hckrnws
The most significant bits of a floating point representation are basically a logarithm of the number. Logarithms have this relation between multiplication and addition.
As the very first article says:
> You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?
You can multiply two numbers by multiplying their mantissas and adding their exponents.
Since the numbers are stored in base 2, the mantissa values range from 1 to 2 (subnormals notwithstanding) - but they're effectively stored as a 0..1 value. The effective range produced by adding vs. multiplying them have considerable overlap, and on average the difference won't be much. Because of how the mantissa and exponent fields are arranged (a deliberate choice by the standard, AFAIK), the mantissa effectively adds additional bits to the approximation of the logarithm given by the exponent.
Edit: this other comment chain https://news.ycombinator.com/item?id=43034230 perhaps puts it better.
To give a "yes and" side-track to your comment: saying "logarithms have this relation between multiplication and addition" is even underselling what logarithms are, because reducing multiplication to an additive operation was the whole motivation for John Napier[0] to discover/invent logarithms:
> “…nothing is more tedious, fellow mathematicians, in the practice of the mathematical arts, than the great delays suffered in the tedium of lengthy multiplications and divisions, the finding of ratios, and in the extraction of square and cube roots… [with] the many slippery errors that can arise…I have found an amazing way of shortening the proceedings [in which]… all the numbers associated with the multiplications, and divisions of numbers, and with the long arduous tasks of extracting square and cube roots are themselves rejected from the work, and in their place other numbers are substituted, which perform the tasks of these rejected by means of addition, subtraction, and division by two or three only.”[1]
Logarithms were honestly an enormous breakthrough in optimization, computers wouldn't be remotely as useful without them, even if most of us don't "see" the logarithms being used.
In fact I'd argue that they are the second-biggest computational optimization in use today, with only positional notation being a bigger deal. Which, funny enough, works kind of similarly: imagine you only could count by tallying (so, unary). Adding two number M and N would take M+N operations, e.g. 1234 + 5678 would require counting all 6912 individual digits. Unary math scales O(n) in both data and computation. Systems like Roman numerals almost work, but as soon as we reach values larger than the largest symbol (M for 1000) it's O(n) again, just with a better constant factor.
With positional notation numbers require only log(n) symbols to write down, and log(n) operations for addition, e.g. 1234 + 5678 requires one or two additions for each digit pair in a given position - one addition if there's no carry from the previous addition, two if there is. So addition at most 2 × ceil( max( log(M), log(N) ) ) operations, so log(n).
Logarithms take that idea and "recursively" apply it to the notation, making the same optimization work for multiplication. Without it, the naive algorithm for the multiplication of two numbers requires iterating over each digit, e.g. 1234 × 5678 requires multiplying each of the four digits of the first number with each of the digit of the second number, and then adding all the resulting numbers. It scales O(di×dj), where di and dj are the digits of each number. If they're the same we can simplify that to O(d²). When the numbers are represented as two logarithms the operation is reduced to adding two numbers again, so O(log(d) + [whatever the log/inverse log conversion cost is]). Of course d is a different value here and the number of digits used affects the precision.
I think the craziest thing of all this is that we're so used to positional notation that nobody ever seems to consider it a data compression technique. Even though almost no other data compression method would work without it as a building block (run-length encoding, Lempel-Ziv, Arithmetic coding? Useless without positional notation's O(log(n)) scaling factor). The only exceptions are data compression methods that are based on inventing their own numerical notation[2].
We do this every day ever since we first learned addition and subtraction as kids. Or as David Bess[3] puts it in his book "Mathematica": ask almost any adult what one billion minus one is and they know the answer instantaneously, so most adults would appear to have mental superpowers in the eyes of pretty much all mathematicians before positional notation was invented (well, everyone except Archimedes maybe[4]). Positional notation is magical, we're all math wizards, and it's so normalized that we don't even realize it.
But to get back to your original point: yes, you are entirely correct. IEEE floats are a form of lossy compression of fractions, and the basis of that lossy compression is logarithmic notation (but with a fixed number of binary digits and some curious rules for encoding other values like NaN and infinity).
[0] https://en.wikipedia.org/wiki/John_Napier
[1] https://en.wikipedia.org/wiki/Mirifici_Logarithmorum_Canonis...
[2] https://en.wikipedia.org/wiki/Asymmetric_numeral_systems
[3] https://www.quantamagazine.org/mathematical-thinking-isnt-wh...
Excellent comment!
As a historical tidbit I'll add that Romans did develop two ways to write larger numbers.
1. Writing a line (vinculum) over a numeral to multiply its value by 1,000. This was in fact extended to writing a line to the left and above a numeral to multiply its value by 1,000,000, and could in principle be extended to lines below and to the right to multiply by 10^9 and 10^12, and even nested boxes for larger powers.
2. The use |), |)), |))), ... for 500, 5,000, 50,000, ... and (|), ((|)), (((|))), ... for 1,000, 10,000, 100,000, ... These can be continued indefinitely.
https://en.wikipedia.org/wiki/Roman_numerals#Large_numbers
Both require an ever increasing number of marks just to write the increasing powers, as well as an ever increasing number of powers being summed, but both increase only logarithmically, so we end up using O((log n)²) marks to write n. This is quadratically worse than positional notation, but exponentially better than just writing M over and over.
Wow, that's amazing! I had somehow never heard of this before despite being into numeral systems. Thank you for sharing!
If you like numeral systems, you might be interested in packages that let one do arithmetic using continued fractions: https://ieeexplore.ieee.org/document/6158099
Note that our use of D and M for the "Roman numerals" 500 and 1000 is a misinterpretation of their |) and (|).
Can you really argue that this use is a "misinterpretation" and not just the usual evolution of written symbols?
Comment was deleted :(
Yes, D and M are letters.
Symbols like that can represent all sorts of things, and numbers are no exception! In fact the Romans had been using other letters to represent their numbers since Etruscan times. Wait until you hear the torrid history of the character V (and its "misrepresentation" U).
And if I may take this yet a step further, this is the point of mathematical transformations: to find a domain in which desirable operations are easier such that the round trip to and from the domain are offset.
In the past, transformations, like logarithms, Fourier transforms, wavelets, had to be proposed. Enabled by advances in computers, machine learning automated all that away by composing differentiable building blocks into universal estimators. The parameters of these building blocks are estimated through gradient descent in conjunction with a user-chosen loss function, which guides the optimization of the transform. Good representations can be manipulated through basic algebra (like added, averaged, and compared for similarity, depending on the task) in a way that corresponds to semantic operations when their raw, untransformed representations can not.
I'm still amazed at word embeddings. They allow to find related words, and even do addition and subtraction.
The way you can supposedly take "king", subtract "man" and add "woman" and you get "queen", and this kind of thing is the mathematical basis of LLMs.
That's an urban legend and not how embbeddings work.
I'm actually genuinely interested to know what you're referring to, because as an I.T. professional who's doing their best to keep up with how things work, that's more or less how I understood it as well. When words are mapped to a continuous vector space, placing semantically related words near one another gives their vector coordinates similar values (see the excerpt from IBM below).
However, I really don't understand how that necessarily enables one to perform arithmetic functions with two sets of vector coordinates and expect the result to be something tangential to the original two words. I understand how using a model to create embeddings with semantically correlated values can be achieved, and why that would be so fundamental for LLMs. My math skills aren't advanced enough to confidently land on either side of this question, but my instincts are that such an elegant relationship would be unlikely. Then again, mathematics are replete with counterintuitive but elegantly clever connections, so I could absolutely understand why this is eminently believable--especially in the context of AI and language models.
According to the descriptions from IBM [0]:
> Word embeddings capture the semantic relationships and contextual meanings of words based on their usage patterns in a given language corpus. Each word is represented as a fixed-sized dense vector of real numbers. It is the opposite of a sparse vector, such as one-hot encoding, which has many zero entries.
> The use of word embedding has significantly improved the performance of natural language processing (NLP) models by providing a more meaningful and efficient representation of words. These embeddings enable machines to understand and process language in a way that captures semantic nuances and contextual relationships, making them valuable for a wide range of applications, including sentiment analysis, machine translation and information retrieval.
> Popular word embedding models include Word2Vec, GloVe (Global Vectors for Word Representation), FastText and embeddings derived from transformer-based models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer).
---
0. What is embedding? (https://www.ibm.com/think/topics/embedding)
It's my understanding of how embeddings work, as well. The dot products (cosine similarity) of man . woman end up being very similar to king . queen. This similarity could be considered subtraction if you stretch the metaphor enough.
If this is not a valid way to think about embedding vectors, would you care to elaborate?
Embedding vectors are when you take a high-dimensionality vector of word ID's and reduce deminsionality to something more manageable. (Think something like principal component analysis or singular value decomposition.)
The "similarity" here means word ID's that commonly occur together in vectors.
There is no logical reasoning or attempts at semantic analysis here.
> The dot products (cosine similarity) of man . woman end up being very similar to king . queen
This is because 'king' and 'man' occur together in a distribution similar to that of 'queen' and 'woman'.
The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend. Embeddings don't carry semantic meanings, they aren't dictionaries or encyclopedias. They are only statistical features about word co-occurrences.
This blog post [0] thinks that it is indeed possible to do arithmetic operations on them. My intuition is that they're vectors after all, and can be added and subtracted like any other vector. A word is just a location in that high-dimensional vector space.
[0] https://blog.dataiku.com/arithmetic-properties-of-word-embed...
EDIT: I guess there are different forms of word embeddings and apparently modern LLMs don't use static word embeddings like word2vec and it's more contextual. Tokens aren't 1:1 with words either of course. I guess it's more complex than "LLMs represent words as vectors". Still though it's a neat trick and is indeed that simple with something like word2vec.
Edit 2: some interesting slides including the bit about semantic analogy and a paraphrase from the original word2vec paper about the king queen thing: https://staff.fnwi.uva.nl/e.kanoulas/wp-content/uploads/Lect...
And the original word2vec paper by Mikolov mentions it https://arxiv.org/abs/1301.3781
The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend.
(Shrug) Take it up with Bishop, page 376: https://i.imgur.com/PgjQK3t.png
And I will, that passage is intentionally written to mislead in a self-serving way. If you know the math behind it you understand that it's technically correct, but a layman or a cursory reading will leave you with a wildly incorrect idea.
It's above my pay grade for sure, but you can get in touch with him at either Microsoft Research, the Royal Academy of Engineering, or the Royal Society, where he holds fellowships. Might want to cc: Yoshua Benjio, who seems to be laboring under similar misapprehensions.
Comment was deleted :(
I remember the glee I felt when I first used logarithms to represent large numbers in Excel. Suddenly, I could easily approximate large factorials, and they were fast! I wondered how many other people have discovered this trick. I couldn't find many references on the Internet. Over time, I realized this method of handling large numbers is used so often that most people who know about it don't bother to talk about it.
This should have been part of the HS curriculum.
Logarithms pop up (IIRC) in 10th grade, and ln(x*y) = ln(x) + ln(y) is usually explained as part of that. What many teachers completely fail to teach is that it's not just a formula to memorize, but that it has profound implications on how you can do math. As you discovered by yourself. (Kudos!)
It is, with the right teacher, a super-exciting story with lots of colorful history & characters. You can even guide your students to come to that some crucial insight. Alas, few teachers have the necessary amount of passion to make that come alive.
But that's probably why few people write about it - for most, either their math interest got murdered in HS so they never get to look at it, or they did learn it in 10th grade and so consider it not worth mentioning.
(Corollary - we all should write more about the cool realizations we had, even if they're in retrospective "well known")
I must admit I wanted to write that all that is true but calculating logarithms is difficult but then looked it up and found that slide rule was invented shortly thereafter:
When the numbers are represented as two logarithms the operation is reduced to adding two numbers again, so O(log(d) + [whatever the log/inverse log conversion cost is]).
This is pretty confused. First, if the d is the number of digits of a number, then you will need O(d) of memory to store its logarithm, otherwise you will loose a lot of precision. Thus, the addition of logarithms is still O(d), not O(log(d)).
Second, calculating log(x) and its inverse, exp(x) is much more expensive than multiplication. For example, if x is between 0 and 1, you can approximate exp(x) pretty well as 1 + x + x^2/2 + x^3/6 + x^4/24. This is 3 multiplications and 3 divisions just to convert.
> Thus, the addition of logarithms is still O(d), not O(log(d)).
Oh dear, yes that's wrong, thank you for catching that.
Still, O(d) a lot better than O(d²). And perhaps more importantly: division (which is even more painful to compute) is a simple matter of subtracting the logarithmic tranformations and then taking the exponent.
> Second, calculating log(x) and its inverse, exp(x) is much more expensive than multiplication.
You're not wrong, but you're talking about the costs of a computer doing the calculation. I was talking about a human doing calculations by hand. What I forgot to mention is that Mirifici Logarithmorum Canonis Descriptio contained 90 pages of precomputed tables. Multiplying two numbers is then a matter of two look-ups, one addition, and another look-up. For large enough numbers that's worth it (and "large enough" is reached pretty quickly in the case of calculations done by hand).
And shortly after the introduction of logarithms William Oughtred invented the slide rule, using two log scales to create an analog computer, removing the need for a table as well (depending on the precision required)
Yeah, if we are talking about humans, then this is definitely a big win: big table lookup is for a human faster than multiplication (not really true for computers), and slide rules make all the difference.
Wish I could up vote this more than once :-). It also is why people had books of 'logs' back in the day that were just the log10 value of a given number.
John Napier is my current math nerd hero! I've been trying to promote "magnitude notation"[0] as the next evolution in numeric notation. If we used mag numbers in science and journalism and even finance, we'd get the same increase in mental superpowers as we got with positional notation.
What a nice idea!
Notation making it easier to intuit logarithmic scales does sound like a good idea in a world where human societies partially struggles to come to action because of things like the majority of people not intuitively understanding exponential growth, or the difference in scale between "millionaire" or "billionaire". It would be a good tool for the mind.
> In fact I'd argue that they are the second-biggest computational optimization in use today, with only positional notation being a bigger deal
For the sake of argument, I think representing numbers in binary is the most fundamental optimization in use today.
You're arguing that positional notation using a specific base is more fundamental than the concept of positional notation itself?
Although I agree that base 2 is a great pick, of course, for many reasons. So perhaps we should put binary in second place, and logarithms in third.
yeah. because positional notation is a general mathematical concept, and it's only in practical application that its power is realized. binary is the most efficient base for computation with the technology we have at hand. positional notation exists for any/every base, but binary is the foundation optimization in practice.
if I were to argue it, that is. In reality they're all really important and it's not like there's a twisted math genie that's forcing us to choose one or the other. I felt like pointing out that chosing base two is also a very important, fundamental optimization decision because it's lead us to the world we have today.
Yes, and overflowing the mantissa moves the exponent in the right direction by the right amount; just now the least significant bits are not quite in the right place anymore. But that's by definition a small error.
Anyone who did high school before calculators (late 70s) would have carried around a book of logarithmic tables so they could do the multiplication necessary for many problems.
Comment was deleted :(
This is the core trick in the Fast Inverse Square Root[0], as made famous by the Quake III source code.
It uses a shift (equivalent to dividing) and a subtraction in integer-land to estimate x^(-0.5) in float-land.
[0]: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root
I've always hated that it's called an “inverse”. It's not an inverse. The inverse of square root is square. If they had called it “reciprocal”, it would have been clear to me what it does, but “inverse” confused the hell out of me when I first saw it.
It’s confusing for non-mathematicians, but (and you may know that) it is not incorrect. https://en.wikipedia.org/wiki/Inverse_element:
“In mathematics, the concept of an inverse element generalises the concepts of opposite (−x) and reciprocal (1/x) of numbers.”
i think this is just typical natural language ambiguity:
"Inverse Square root" is used to mean "(Multiplicative) Inverse (of the) Square root" as opposed to "(Square root) Inverse"
Comment was deleted :(
Comment was deleted :(
No, it's confusing for mathematicians because it is directly opposing math terminology. (Which is okay, but confusing.)
It doesnt though, inverse strictly speaking means the multiplicative inverse. The idea was extended in inverse functions eventually which for sake of brevity we also call the inverse. Now f^{-1}(x) and f(x)^{-1} are clearly different beasts the first would describe the square (for square root) while the second describes what quake was doing. Wether you call something an inverse or its full name (multiplicative/functional - inverse) depends on if its unclear from context which was meant.
the function inverse is the multiplicative inverse in the group of automorphisms over sets (when the multiplication operation is functional composition).
I think it's clear we both know that but for the sake of the commenter arguing the two things are different, it's does not help to simply say they the same without giving example, no?
you misunderstand my intent, sorry! i was boosting your point, not arguing against it.
please insert the word "yep," in front of my parent comment
Found the Haskell guy.
im so anti-haskell, your comment pained me to hear
Right, the bang per syllable loss is it's ambiguous, not that it's incorrect. Inverse square root could also mean -(x^0.5) if it meant the additive inverse, and it could mean x^2 if it meant the functional inverse, as said here.
Since calling the square an inverse square root makes zero sense in any practical context, the other common meaning being applied is obvious. In this case, it's not the inverse of a _function_, it's the inverse of a _number_, which is usually considered to be the same as the reciprocal.
as a mathematician: its fine. the word "multiplicative" is just silent in the phrase.
Comment was deleted :(
Interestingly Intel got it right, thus the rsqrt family of instructions.
The inverse is about the negative power right? Square root is the 0.5
"inverse" has a very specific meaning: inverse(x) * x = 1
x^2 * x != 1 for any x other than 1. So no, x^2 is not the inverse of sqrt(x)
No. "inverse" has a very specific meaning: the operation that undoes whatever operation you're talking about.
https://en.m.wikipedia.org/wiki/Inverse_function
The inverse of multiplying by 3 is dividing by 3, and vice-versa.
The inverse of squaring is squart root, and vice-versa.
The inverse of a number or other element (rather than an operation / function) means whatever number would form the inverse if you use it with whatever binary operation you're talking about. So the additive inverse of 3 is –3, the inverse of "rotate clockwise 90 degrees" in the group of geometric operations is "rotate anticlockwise 90 degrees", and the multiplicative inverse of 3 is 1/3.
Saying "the inverse of 3" to mean 1/3, i.e. its multiplicative inverse, is sloppy but acceptable so long as everyone knows what you mean (and it's fairly common). Saying "the inverse square root" to mean 1/square root is just wrong.
Is it acceptable to use the imperative? That is, to say: “Now, invert the element”. Or should it always be, “take the inverse of …”?
But sqrt · square = 1, where "sqrt: R⁺ → R⁺" is the square root operation, "square: R⁺ → R⁺" is the squaring operation, "1: R⁺ → R⁺" is the identity operation, and (·) is function composition: i.e., working in the monoid of functions R⁺ → R⁺. So x² and sqrt(x), as elements of R, are not inverses, but sqrt and square, as elements of R⁺ → R⁺, are.
It depends on how you parse the phrase "fast inverse square root".
This is true, but interpreting it one way is clearly absurd (nobody would call a square the “inverse square root”), which implicates that the other meaning was intended.
Neat. Of course it works for the exponent, but that it's not that far off for the mantissa is unexpected. It helps that the mantissa is normalized.
For the mantissa, the insight is probably that :
(1+e1)*(1+e2) = 1+e1+e2+(e1*e2)
If e1 and e2 are small, then e1*e2 is negligible.
Huh, and in the case that e1 and e2 are large, the mantissa overflows into the exponent, multiplying the result by 2.
I guess we're lucky that the IEEE 754 committee put the exponent in front of the mantissa. Or maybe they were aware of this trick all along ...
This arrangement makes ordering easier anyway which is probably why they chose it.
If we have some positive 32-bit integers A and B then if A < B, f32::from_bits(A) < f32::from_bits(B)
Edited to add anecdote:
I actually had a bug in realistic (my Rust crate to implement Hans Boehm's "Towards an API for the Real Numbers") which I only fixed recently, for converting my Real number into a floating point type where I might end up with too many mantissa bits, but then sometimes coincidentally the bottom bit of the exponent is zero, and so instead of increasing the exponent by one I actually overwrite it with a one and that's the same result.
I finally caught that when it occurred to me to do "thorough testing". That is, take all possible 32-bit floats, turn them into a Real, then, turn that Real back into a 32-bit float, this should be a roundtrip, if it's not we've found a bug. There are "only" a few billion of these values to test, a machine can do that.
I fixed the 64-bit routines for the same bugs, but I don't have enough years to run all the 64-bit tests the same way and discover any that aren't paralleled.
[Of course there are presumably still bugs in the conversion algorithms which may either be symmetric, and thus the bug doesn't impact a round trip, or they don't affect the binary fractions used exclusively by floats, Real has proper rationals and various other things like the square root of integers, Pi, etc. which we can convert to a float and lose precision but we never get these from converting a float because floats are always binary fractions.]
> I fixed the 64-bit routines for the same bugs, but I don't have enough years to run all the 64-bit tests the same way and discover any that aren't paralleled.
Minor thought for today: there are more than 2^64 bits of RAM in the world.
> If we have some positive 32-bit integers A and B then if A < B, f32::from_bits(A) < f32::from_bits(B)
That is, if both are valid floats that represent a finite non-zero number.
Correct.
Floats support both -0 and 0 they compare equal but their bit representations don't. Zero can be tricky.
Not a numbers (NaNs) always compare as false. Their bit representations won't.
Comment was deleted :(
> realistic (my Rust crate to implement Hans Boehm's "Towards an API for the Real Numbers")
So cool, I didn't know anyone was working on this!
If you're familiar with that work, realistic implements most of it (some trig functions I didn't get to yet) but it's deliberately not 1.0 because I'm still tinkering with core API things like, I changed how the errors work this week.
If you're not familiar, a fact about the reals which cannot be emphasised enough is that almost all real numbers are non-computable. This serves to properly temper expectations, regardless of whether you know what the mathematical term "Almost all" means. We're going to be able to compute some useful reals, and then do arithmetic with them, and not get miserably bad rounding errors like for floating point for example. We just need to keep in mind that "towards" here is just gesturing in a direction, it's not a destination which can be reached.
Haha I'm familiar, I actually also tried to implement the paper at some point. It's such a cool idea and a great example of the difficulty that goes in to making a really good UX, even for something as simple-sounding as a calculator app.
What's obvious to Kahan is being relearned by the rest of us.
Previous discussion of the paper:
Addition is all you need for energy-efficient language models - https://news.ycombinator.com/item?id=41784591 - Oct 2024 - 126 comments
This can be very worth it in circuit design for custom accelerators. Floating point operation is often multicycle. If you can get close enough with addition it will save a ton of resources and probably also simplify the design.
It's math and not just for integers. That's also how slide rule works.
a×b = 10^(log(a×b))
log(a×b) = log(a)+log(b)
thus a×b = 10^(log(a)+log(b))
Yes, but what is nontrivial here is how good the approximation is, despite the mantissa part being linear (much better than the upper bound of being within 2x of the real value).
Nontrivial yes, but it's not that complicated to figure out that floats are a piecewise linear approximation of the base 2 logarithm.
Edit: or base 2^k, depends a bit how you view it.
The very first paragraph of the article says this and then goes on to say the trick gets the mantissa roughly right too:
> You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?
> The paper talks about using this to save power, but that’s probably not worth it for a few reasons
It's probably not worth it to do this in software. But in hardware, it might be!
A similar approach with a hardware prototype. https://research.nvidia.com/publication/2022-12_lns-madam-lo...
I wonder if there was any inspiration from Quake III.
Didn’t the inverse square root trick rely on bit-level floating point and subtraction of a bias?
float32 (1+m)×2^e as int32 = e<<23 | m
(1+m1)×2^e1 × (1+m2)×2^e2 = (1+m1+m2+m1×m2)×2^(e1+e2)
If m1×m2 is small, that's approximately float32(m1+m2, e1+e2).
Would subtraction also approximate division?
Yes
Would negation also approximate the reciprocal?
Yes. Logarithms are wonderful like that :)
Comment was deleted :(
Comment was deleted :(
Comment was deleted :(
Crafted by Rajat
Source Code