One very interesting thing about automatic differentiation is that you can think...

kristjansson · on March 23, 2024

Dual numbers implement forward mode automatic differentiation, but is there additional value to viewing AD in terms of duals e.g. when we’re implementing reverse mode (backprop)?

naasking · on March 23, 2024

Depends what you mean by "additional value". Dual numbers are very simple, and enough for reverse mode AD too though, even via a purely functional implementation:

Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation, https://dl.acm.org/doi/pdf/10.1145/3498710

MikeBattaglia · on March 23, 2024

This is a good question that doesn't have a short answer. There are some different philosophical opinions about this.

One way to look at this is to note that even with forward mode autodiff, there have generally historically been two different viewpoints for the whole thing, which I'll call the "computer science" view and the "algebraic" view.

The computer science view involves things that look computational graphs with nodes that look like {"value": 123, "deriv": 456}. We are storing real values along with propagated partial derivatives. We have a custom * operator which sets a * b = {"value": a.value * b.value, "deriv": a.value * b.deriv + b.value * a.deriv}. Other functions, like exp, sin, cos, log, etc are also extended to handle these kinds of input. There's a ton of literature that views things using this kind of framework going back to the 1960s.

The algebraic view uses dual numbers. Instead of writing {"value": 123, "deriv": 456}, we write 123 + 456h. We get the same results as the above: (a + bh) * (c + dh) = ac + (ad + bc)h. We can extend many functions in a natural way to the dual plane, such as exp, sin, cos, log, and get values there. There's also plenty of literature on these, going back to the late 1800s.

A modern view is to note that these two things are *identical*. It isn't that forward mode autodiff "uses" dual numbers; it is dual numbers. The set of elements of the form {"value": x, "deriv": y}, with addition and multiplication as stated, satisfies the axioms of a real algebra and is isomorphic to the dual numbers. We could have written {"real_part": x, "dual_part": y} if we wanted. You can see this viewpoint in some of the links I've posted elsewhere here.^[1]

So given all of that background, there are two answers to your original question. The first is to just simply view it as that "reverse-mode autodiff doesn't use dual numbers." Many people have this view, and I would say that that it really focuses on what I've called the "computer science" view above.

The second view is to note that the relationship between reverse-mode autodiff and dual numbers is the same as the relationship between reverse-mode autodiff and forward-mode autodiff. It would be silly to say that they are totally different, unrelated things: at the end of the day all we are really doing is changing the order in which we perform a bunch of multiplies of Jacobian matrices. I tend to view it as similar to the relationship between the DFT and the FFT: there is this super elegant linear algebra view involving DFT matrices. Do we say that the FFT "doesn't use matrices?" Well, I guess, but are we going to go so far as to say that it also doesn't involve linear algebra, etc? That is my view.

There are a few other differences between reverse-mode and forward-mode autodiff. Each individual operation in the computation graph, for instance, can be thought of as an individual instance of forward-mode autodiff. In reverse-mode, on the other hand, we typically store the (back-)propagated adjoints as additional data on the input node objects, not the output nodes. This is useful if we are thinking of backpropagation on a graph. It's up to you if you view these as involving materially different theories or just differences of implementation for the sake of optimization.

In short, the main thing is that there's less literature on a purely algebraic version of reverse-mode autodiff in general.

[1] There is one important difference: we often think of the dual numbers as an ordered ring with "h" infinitesimal. The first viewpoint doesn't use this part of the mathematical structure - and it's very interesting to note that it isn't even necessary! Although it's cute to think of h as infinitesimal, the autodiff properties we get flow purely from the algebraic properties of having h^2 = 0, regardless of order. Of course, though, we can always just think of the dual numbers as an unordered ring, if we want.

compacct27 · on March 22, 2024

Where do I go to learn what you just said?

MikeBattaglia · on March 22, 2024

There are a few good resources. Here is a good blog post with some introduction:

https://towardsdatascience.com/forward-mode-automatic-differ...

Some thorough notes from MIT: https://book.sciml.ai/notes/08-Forward-Mode_Automatic_Differ...

Here is a teacher who taught a class using these, calling the dual element dx instead of h:

https://cornellmath.wordpress.com/2007/08/28/non-nonstandard...

smokel · on March 22, 2024

The book referred to in this post has some information about this. The method with "dual numbers" is called "forward-mode automatic differentiation". PyTorch seems to use "reverse-mode automatic differentiation", which does not use dual numbers, but keeps track of the computation graph.

fpgamlirfanboy · on March 22, 2024

you guys are wrong and spreading blatant misinformation - there is no magic number whose square is 0 but which is itself not zero anywhere in pytorch or tensorflow or any other real DNN framework that i'm familiar with. it's all fun and games to participate in math woo but you shouldn't be proclaiming things you don't actually know on a public forum.

SideQuark · on March 23, 2024

The dual numbers exist just as surely as the real numbers and have been used well over 100 years

https://en.m.wikipedia.org/wiki/Dual_number

Pytorch has had them for many years.

https://pytorch.org/docs/stable/generated/torch.autograd.for...

JAX implements them and uses them exactly as stated in this thread.

https://github.com/google/jax/discussions/10157#discussionco...

Many other frameworks use them also, for many reasons.

As you so eloquently stated, "you shouldn't be proclaiming things you don't actually know on a public forum," and doubly so when your claimed "corrections" are so demonstrably and totally incorrect.

patrick451 · on March 23, 2024

They exist by definition. Your claim makes no more sense than confidently proclaiming that there is no x such that x^2 < 0. We invented the imaginary numbers and made it so. So too the dual numbers.

drdeca · on March 23, 2024

I don't know why you keep calling it "magic". Whether or not pytorch uses them, they aren't magic, neither in the derogatory sense nor in the praise sense.

fpgamlirfanboy · on March 23, 2024

> I don't know why you keep calling it "magic".

because they have all of the gee-whiz factor of a freshman calc proof of the chain rule that divides and multiplies infinitesmals and absolutely not enough of the substance necessary to prove much more than that. they are absolutely, in the research literature, at best an anachronism (harkening back to leibniz) and at worst a parlor trick.

in literally my first response i provided the most trivial counter-example to the magic of non-standard analysis. no answers (crickets). i surmise this is because the people in here talking it up aren't really serious.

drdeca · on March 24, 2024

As I use the terms, the dual numbers are a different thing from non-standard analysis. Non-standard analysis, as I understand the term, uses non-standard models of the real numbers, and its infinitesimals do not satisfy h^2 = 0. In non-standard analysis, f'(x) is the standard part of (f(x+h)-f(x))/h , for an infinitesimal h (i.e. for a non-standard real which is smaller than any non-zero standard rational number). (In order to apply this definition, f should be defined in a way which does not use anything requiring determining if a number is standard, or taking the standard part of something, etc.)

The dual numbers, on the other hand, are the ring R[h]/(h^2) . This is not a field, while non-standard models of the real numbers do form fields.

The dual numbers suffice to define differentiation of polynomials (which may not be sufficient for some purposes! [a]), and something like dual numbers is used in algebraic geometry to define the Zariski tangent spaces for points of algebraic varieties (whether in characteristic 0 or in characteristic p. In characteristic p, one certainly can't use an epsilon-delta definition!).

I really don't see your point about the "gee-whiz factor". While different things, both non-standard analysis and dual numbers can be handled rigorously, and have their use-cases, even though I certainly would at least default to thinking of differentiation in terms of the limits definition (assuming I'm thinking of any specific definition at all).

I assume that the counter-example you refer to is the (dx/dy)(dy/dz)(dz/dx) thing. That indeed doesn't seem like the kind of thing that using the dual numbers would be especially suited for. Though, also, not the sort of thing that should really come up in auto-diff I would think?

If there is a common parameterization of the values of x,y,z by some variable t (on some interval I), on some neighborhood of the point under consideration, where {(x(t),y(t)) | t in I}, {(y(t),z(t)) | t in I}, and {(z(t),x(t)) | t in I}, are each differentiable functions, and where they satisfy the relationships between the variables x,y,z from the larger context, and all three of (dx/dy), (dy/dz), (dz/dx) exist at the point in question, then it seems that the product should be 1.

[a] Though, if the function is analytic (not just smooth), unless I'm missing something, it should also give the right answer (but still not a good definition of course, because should define differentiation before defining what it is for a function to be analytic.) (of course, just because the function is analytic doesn't make its domain include the dual numbers. One has to take a power series for it, and apply this power series to the element of the ring of dual numbers, not apply the original function to it.)

smokel · on March 23, 2024

The grandparent I'm responding to sure uses a very sloppy presentation of things. Not everyone here is a trained mathematician though, so you may want to give people some slack.

Obviously, if h² = 0, then h = 0, so this statement made no sense. What the author probably tried to convey, is that one can reason with infinitely small values as symbols, and perform automatic differentiation with that.

missingET · on March 23, 2024

No, there’s an abstract algebra extension of real numbers to have an extra symbol h such that h^2=0. This is not a real number so you cannot apply the argument h^2=0 implies h=0, much like complex numbers don’t obey all properties of real numbers.

(For example for real numbers, x!=0 implies x^2>0 but i^2=-1)

https://en.m.wikipedia.org/wiki/Grassmann_number

thesz · on March 24, 2024

  a^2 = 1, first base vector is a regular one
  b^2 = -1, second base vector is "imaginary"
  ab = 0, base vectors are orthogonal

  (a+b)^2 = a^2 + 2ab + b^2 = 1 + 2\*0 + (-1) = 0

Trick is taken from conformal geometric algebra [1].

[1] https://en.wikipedia.org/wiki/Conformal_geometric_algebra

clooper · on March 22, 2024

Your are confusing non-standard and dual numbers. The dual numbers are not ordered and contain non-invertible nilpotent elements such as h which squares to 0.

MikeBattaglia · on March 22, 2024

These things are different and I did mean dual numbers. The dual numbers do form an ordered ring. When you complain about certain elements being non-invertible, I think you are probably complaining that they aren't a field.

clooper · on March 22, 2024

If a < b then a^2 < b^2. This is not true if you let a = 0 and b = h. The dual numbers do not have an ordering. You should provide sources and proofs next time because it seems like you are just making things up.

MikeBattaglia · on March 23, 2024

That isn't what an ordered ring is. Your property of a < b → a² < b² doesn't even hold true in the integers. For instance, let a = -2 and b = -1.

The correct property is that if a ≤ b, a + c ≤ b + c, and if a ≥ 0 and b ≥ 0, then ab ≥ 0. It is fairly easy to see that these properties hold for dual numbers.

clooper · on March 23, 2024

In my argument a and b are positive and this is true for all ordered rings but not for the dual numbers as you've defined them. Specify the ordering and you will realize h can not be larger nor smaller than 0 because both cases lead to a contradiction.

In any case, I'm dropping out of this thread.

MikeBattaglia · on March 23, 2024

It isn't true for all ordered rings, and the dual numbers are in fact a counterexample to the claim that it is true.

Beyond that I'm not sure what to tell you, other than it's fairly easy to see that the dual numbers do satisfy the axioms of an ordered ring that I gave. Here's a large survey of various infinitesimal systems by Philip Ehrlich where he also notes the dual numbers are an ordered ring: https://arxiv.org/pdf/1808.03345.pdf.

fpgamlirfanboy · on March 22, 2024

> Most autodiff packages (such as Pytorch) use something not much more advanced than this

pytorch absolutely does not use the dual number formulation - there are absolutely no magic epsilons anywhere in pytorch's (or tensorflow's) code base. what you're calling duals are the adjoints where are indeed stored/cached on every node in pytorch graphs.

there's a reason no one uses dual numbers (non-standard analysis) for anything (neither autodiff nor calculus itself): because manipulating infinitesmals like this is fraught formal manipulation (it's algebra...) where as limits are much more rigorous (bounds, inequalities, convergence, etc.). my favorite question to ask the non-standard analysis n00bs is: please tell me under what conditions this is true

(dx/dy)(dy/dz)(dz/dx) = 1

edit:

anyone that thinks i'm wrong and this other guy is right should go and do some reading, eg where this guy tried to make this same point and got shot down:

https://math.stackexchange.com/a/341550

spoiler alert: there's a reason you had to learn epsilon-delta proofs and limits and it's not because your math professors are mean.

this is why i hate this kind of "TIL, gee whiz" math tidbits - they're full of exclamation marks and fancy sounding words ("non-archimedean rings" oooo fancy) but almost always come from a wikipedia level understanding, not actual research.

MikeBattaglia · on March 22, 2024

At the end of the day, if you are storing inputs and outputs to a function as a pair of numbers - one for the actual value, and one for the derivative - and if addition and multiplication work the way you expect and propagate derivatives correctly - then you are using dual numbers, regardless of if you notate it a + b*h or {"value": a, "derivative": b}.

Pytorch does things slightly differently in that it is mostly focused on reverse-mode autodiff, and so it stores adjoints relative to the overall output rather than partial derivatives relative to the input, but this isn't really an entirely different thing, in the same way that the FFT isn't entirely different from the DFT.

There seems to be some confusion about the relationship between dual numbers and smooth infinitesimal analysis. Both have nilpotent elements, but with dual numbers the background logic is classical, whereas it isn't with smooth infinitesimal analysis.

EDIT: I see you've edited your post to try to get in some extra criticism after I've already responded. That's terrible form, so I'll just respond here.

Dual numbers are a nice way to get started with forward-mode autodiff, to which it is so related that the two are essentially the same thing with different labels. Pytorch instead uses reverse-mode autodiff. Reverse-mode and forward-mode autodiff are different, but not so different that they are entirely different things. Reverse-mode is, as I put it in my OP, "not much more advanced" than forward-mode, even if not identical.

What is entirely different, much more advanced, and what Pytorch really doesn't do, is anything like the "epsilon-delta proofs" you keep hanging your hat on. If Pytorch did that, it would be useless. The entire point of autodiff is to avoid such things.

Beyond that, I would suggest slowing down a bit as you are mixing quite a few things up. Nonstandard analysis has nothing to do with dual numbers at all, for instance. And you're very much misinterpreting that MSE post of mine you linked to (thanks!).

fpgamlirfanboy · on March 22, 2024

> and if addition and multiplication work the way you expect and propagate derivatives correctly - then you are using dual numbers

you literally started out your miraculous comment with

> This new algebra is called the ring of "dual numbers." The difference is that instead of adding a new element "i" with i² = -1, we add one called "h" with h² = 0!

not some observation about caching derivatives.

so i'll repeat myself for the 3rd time: there are no magical numbers anywhere in pytorch or tensorflow or cafe or any other serious autodiff implementation that abide by the rules you so jubilantly exclaim about.

barfbagginus · on March 23, 2024

See my comment to Mike. I think you're making a valid point here, which is that dual numbers by themselves are not powerful enough to automatically generate derivatives of arbitrary functions, especially given that those functions could be implemented in a FPU core, or using methods like lookup tables that don't lend themselves to dual number differentiation.

Dual numbers help us automatically differentiate things when the functions themselves are implemented as analytic power series that we have to explicitly compute without accelerator help. In such cases we can indeed use them. But to your point, serious forward AD engines need to differentiate functions that are computed in one shot by accelerator hardware.

However Mike makes a very valid counterpoint when he shows forward mode AD in Torch. I believe a careful analysis of Torch's implementation here could bring this conversation to a productive and satisfying conclusion for all participants and our public audience.

My big question here is to what degree did the implementers try to respect the dual number approach? Did they implement a dual tensor class for instance? Do they automatically lift some ordinary computations into dual tensor computations? I honestly have my doubts there.

I have confidence that we can get to the bottom of this. I think that Mike actually does care about automatic differentiation, and would be receptive to discussing this point of subtlety that naive dual number implementations may not be enough for industrial strength AD systems, with clear examples of code and clear reasoning as to how dual numbers fail in important cases.

MikeBattaglia · on March 22, 2024

Thank you for repeating yourself three times. It seems like you think that the dual number algebra involves "magic woo numbers." It seems like you haven't really worked through this stuff too much. I would suggest reading some of the resources above, such as the MIT lecture series. The rest of your points I think I have already addressed, though you ignored in your reply - I've said Pytorch does reverse mode diff several times at this point.

fpgamlirfanboy · on March 22, 2024

> It seems like you haven't really worked through this stuff too much

yup not at all - i just wandered in off the street and knew accidentally that you were talking about non-standard analysis.

> The rest of your points I think I have already addressed

please show me the source line number in pytorch or tensorflow that defines this number

> we add one called "h" with h² = 0!

MikeBattaglia · on March 23, 2024

Sure, right here: https://github.com/pytorch/pytorch/blob/main/torch/autograd/...

Here's the documentation: https://pytorch.org/tutorials/intermediate/forward_ad_usage....

> When an input, which we call “primal”, is associated with a “direction” tensor, which we call “tangent”, the resultant new tensor object is called a “dual tensor” for its connection to dual numbers[0].

barfbagginus · on March 23, 2024

This could help settle the objection that torch doesn't implement dual number based Forward Accumulation.

But I'm wondering if it does it by implementing dual tensors and automatically 'lifting' ordinary tensor computations into dual tensor computations? That would be a little surprising to me.

The more common approach I have seen is that we decorate existing operations with additional logic to accumulate and pass on a derivative value as well as the actual value during evaluation. This can be important for instance for transcendental functions, which might be computed with methods like lookup tables and approximate series, which do not necessarily lend themselves to accurate dual number computations, but do have a straightforward formulas for the derivative. It can also be a requirement when our transcendentals are computed in the FPU, which does not expose any power series to automatically thread dual our numbers through.

It would make sense in the case of something like pytorch if this were the case, since it could be a bit of a stretch to expect the correct numbers to appear if only we just compute everything with dual numbers. Indeed, the original torch functions certainly exploit the FPU, so we very likely have to explicitly formulate a derivative in at least some cases.

I wonder if this observation could help heal the rift between the two positions here - it seems like your counterpart could be satisfied with the view that most forward mode AD it's not quite as "pat" as just injecting a dual numbers library into existing code, but requires careful extension to accurately accumulate the derivatives of each operation in the system.

I believe that reaching common ground around that fact could help your counterpart reach a satisfying conclusion here. The methods are clearly dual number in spirit, but may require more subtle implementation details then the traditional dual number story, which states "dual numbers get you free derivatives with no need to extend your functions". Pointing out how this does break down for general functions is not only true, but could serve as an olive branch and opportunity to advance this discussion.

Besides that remark, which I made in the intent of resolving a conflict and potentially fostering communication, I wanted to thank you for this amazing link that you shared. I did not know that pytorch had forward mode AD! I may just have to dig into it and see how they pull it off!

samatman · on March 23, 2024

You seem somewhat obsessed with the idea that reverse-mode autodiff is not the same technique as forward-mode autodiff. It makes you,,, angry? Seems like such a trivial thing to act a complete fool over.

What's up with that?

Anyway, here's a forward differentiation package with a file that might interest you

https://github.com/JuliaDiff/ForwardDiff.jl/blob/master/src/...

patrick451 · on March 23, 2024

I can't reply to the guy saying julia is the only one. But there are others.

Ceres uses dual numbers

https://github.com/ceres-solver/ceres-solver/blob/master/inc...

This library from google is used everywhere in robotics, so it's hardly some backwater little side project.

So does c++ autodiff https://github.com/autodiff/autodiff/blob/main/autodiff/forw...

So does Eigen: https://eigen.tuxfamily.org/dox/unsupported/AutoDiffScalar_8...

MikeBattaglia · on March 23, 2024

So does Pytorch: https://pytorch.org/tutorials/intermediate/forward_ad_usage....

fpgamlirfanboy · on March 23, 2024

it's amazing to me that pointing out a straight up mathematically factual inaccuracy is considered "angry" and "acting a fool".

samatman · on March 23, 2024

> there's a reason no one uses dual numbers (non-standard analysis) for anything (neither autodiff nor calculus itself)

wrong

> there are no [dual] numbers anywhere in [...] any other serious autodiff implementation

wrong

> please show me the source line number

did

> i'm wrong and this other guy is right

correct.

> this is why i hate this kind of "TIL, gee whiz" math tidbits

acting a fool. angrily so.

fpgamlirfanboy · on March 23, 2024

is this like some kind of version of truman show? the full sentence is

>please show me the source line number in pytorch or tensorflow that defines this number

why? because the original comment makes a claim about pytorch. it's all right there in black and white.

samatman · on March 23, 2024

Which claim, this one?

> Most autodiff packages (such as Pytorch) use something not much more advanced than this, although there are optimizations to speed it up (e.g. reverse mode diff).

We get it: you're a mechanic who thinks that makes you an automotive engineer.

fpgamlirfanboy · on March 23, 2024

i haven't a clue what that means but when julia's autodiff is the only one that is implemented using dual numbers i think the claim is pretty obviously false and seemingly everyone has an issue admitting that. without a doubt you'll just say something else sarcastic rather than admit it.

SideQuark · on March 23, 2024

You've filled this page with comments on non-standard analysis, but the dual numbers have precisely zero to do with it. Calling people n00bs on a topic you apparently do no understand is silly.

Non-standard analysis deals with fields only, and the dual numbers are not a field, there are infinitely many zero divisors.

You should read the wiki pages on both, then maybe this mathoverflow post explaining it. The clearest way to maybe grasp the difference for you is that in any formulation of non-standard analysis, the square of any infinitesimal is another infinitesimal, and never 0. In the dual numbers, the square of any infinitesimal is always precisely, exactly zero.

They are so fundamentally different that anyone (like you) that claims to be so cognizant of either would never repeat they are the same as loudly and frequently as you are.

https://math.stackexchange.com/questions/341535/is-the-theor...

bollu · on March 22, 2024

You're right on one count: no library implements forward mode. Hence, you're correct that no autodiff library (including pytorch) implement autodiff this way.

However, *you're wrong* that forward mode cannot be written in terms of dual numbers. The point is that the addition and multiplication operation for dual numbers correspond exactly to rules of the derivative of addition and the derivative of the product.

fpgamlirfanboy · on March 22, 2024

i didn't say it cannot be, i said there's a natural reason it's not: the same reason the rest of analysis isn't.

> You're right on one count: no library implements forward mode

there are plenty of CFD type libraries that implement forward mode. they also do not use what i'm calling "magical" dual numbers i.e. these nilpotent ring elements.

noqc · on March 23, 2024

Nilpotents aren't exactly new.