Rendered at 10:02:06 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
psanchez 4 hours ago [-]
This reminds me of a story from 15 years ago, where I was developing a technology to download games on demand by hooking into the OS calls.
There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.
When I started digging into it, I realized the reason was the game was using something like
fread(data, 1, 65536, fptr);
instead of
fread(data, 65536, 1, fptr);
Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.
The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.
Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.
mort96 4 minutes ago [-]
Wait, is that wrong? I always call fread as:
fread(data, 1, sizeof(buffer), f);
with the rationale that I'm interested in reading sizeof(buffer) individual bytes. The buffer size is incidental, not the size of the items I'm trying to read from the file; "read one item whose size is sizeof(buffer)" seems semantically wrong.
Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?
Taniwha 3 hours ago [-]
I used to be a graphics card/chip architect for macs in the early/mid 90s - our chips were the fastest, but some programs were resistant because they did stupid stuff: pagemaker invalidated the font cache every time it went thru its main loop, quark with ATM did an n*2 thing every time it wrote text etc etc. We had special hardware to accelerate text drawing and it did nothing because the software pissed it away. We considered creating a plugin that fixed all these things, it would have been hard to maintain, in the end we travelled around to the people who made these apps and talked them through their problems
To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)
At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.
urbandw311er 2 hours ago [-]
This is a horrible and yet not unexpected insight into the internals of Excel
Taniwha 2 hours ago [-]
To be fair this was Excell 25 years ago, may no longer be true.
One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought
trelbutate 49 minutes ago [-]
> To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels
I feel like I'm having a stroke trying to read this, what does it mean??
NSUserDefaults 37 minutes ago [-]
Several layers of white is what makes the black really pop. (Just kidding).
sixeyes 40 minutes ago [-]
before writing to some area, it would erase it (clearing with white) up to 9 times
b112 24 minutes ago [-]
It means they were time travellers! Secretly, they came from an alternate future where everyone used e-ink displays, and wanted Excel to be ready!
bathtub365 1 hours ago [-]
In all of the software you’ve written, are you aware of how many on-screen pixels you’ve overdrawn?
Someone 2 hours ago [-]
> Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API
What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?
Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.
(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)
micampe 59 minutes ago [-]
A long time ago I worked with someone who read 1 byte at a time from a socket because they insisted data was cached so the kernel was going to batch it magically somehow. It took me days to convince them to measure it.
quietbritishjim 39 minutes ago [-]
That's different: you're talking about the application code, like OP.
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
b112 18 minutes ago [-]
Is it? I get what you're saying, but asking for 1 byte 65536 times, is indeed different than asking for 65536 bytes, 1 time. There may be reasons, such as when you pull off the end of a buffer, it shifts. And the buffer size is 1 byte. Or 10. Or whatever.
No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.
dd bs=10k count=1 is faster than bs=1 count=10k
I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.
somenameforme 2 hours ago [-]
Doesn't that break anything relying on the return value? fread gives you the number of objects read as a return. So I think a pretty typical thing would be to fread and then parse that number of characters, and that'd just break?
jcul 2 hours ago [-]
I've seen a lot of code that just assumes fread / fwrite succeeded without bothering to check the return value...
But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!
Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.
account42 1 hours ago [-]
There are no small freads in the story, whatever implements those freads supposedly split them up into many calls. But that sound more like a problem of that implementation than the fread callers as size == 1 is correct when you are reading a bag of bytes.
koolala 2 hours ago [-]
I think they turned it from a tiny file read to a tiny ram read.
DonHopkins 2 hours ago [-]
The type of programmer who swaps the args to fread tends to be the type of programmer who doesn't bother to check the return value, fortunately.
account42 1 hours ago [-]
But the args aren't necessarily swapped just because they end up in a slow case in some implementation.
lukan 2 hours ago [-]
"I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff"
I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.
dlcarrier 4 hours ago [-]
SimCity had a read-after-free bug that Microsoft patched in Windows 95. That was a lot easier for customers than having Maxis fix it, which could have required exchanging copies of the game.
Cthulhu_ 3 hours ago [-]
It feels like graphics drivers do / did this a lot too. At the very least they make specific optimizations for specific games, probably by tweaking settings and features that the game developers didn't optimize properly themselves.
That's a case of the driver cheating but there are also lots of cases where the game is just full of bugs that the driver has to work around in order to not be blamed for them.
Gibbon1 47 minutes ago [-]
I've said over the years a few times, this isn't our fault but it's our problem.
SyzygyRhythm 1 hours ago [-]
There are many, many, cases like this, including correctness fixes. One recent example I remember had a shader that computed:
x = a / b * b
The optimizer was allowed, but not obligated, to transform that into:
x = a
However, in this case, b was sometimes 0. And if so, the unoptimized version computed:
x = a / 0 * 0 = Inf * 0 = NaN
So badness ensued if the that particular path didn't get optimized, which could happen under various circumstances. We had to add some code to ensure that transformation always happened on that game.
easyThrowaway 2 hours ago [-]
The most interesting part is that IIRC they shipped the entire Windows 3.11 memory allocator to make it work.
I have very little understanding on how allocation works at OS level, but I'm surprised there are no wrappers like dgVoodoo or dxWrapper specifically for this kind of issues. There are quite a bunch of old Windows games (Need for Speed 1-4 for a start) that refuse to run on modern OSes due to rather...bold memory management strategies.
rincebrain 2 hours ago [-]
Apparently the recollection of the fix was that they deferred actually freeing memory for a while if they detected it was SimCity running. [1]
A story I heard at Sun, which may be apocryphal but was fucking hilarious enough to be a repeatable rumor, was that a release of an early operating system in BETA was determined to be solid and tested and ready to release and ship to customers, so they simply changed the version string from something like "SunOS2.1BETA" to "SunOS2.1FCS" (First Customer Ship), and recompiled. But the change from a 12 character version to an 11 character version threw off the alignment of some important data structures somewhere in the kernel, and the entire OS ran MUCH SLOWER because of 68k unaligned memory accesses!
hodgehog11 5 hours ago [-]
I think we're starting to see more of this sort of thing happening now with Proton and Wine gaining prominence in the Linux community. Some games (Elden Ring comes to mind) have bad enough PC ports when they come out that the compatibility layer can incorporate a hotfix to improve performance, while users of the software on the original platform still had to suffer.
Gigachad 4 hours ago [-]
Fairly sure GPU drivers do the same thing where they include a ton of per game tweaks to make them run faster. It does feel like a fragile way of doing things where an external component that should be agnostic to the software running ends up including a handful of junk trying to fix stuff that should have been fixed by the consumer of the driver.
zoenolan 2 hours ago [-]
The big one I remember was many applications, not just games assuming the buffer swap was performed by a blit into the display buffer, not an framebuffer pointer update. They relied on the previous frames data still being in the back buffer. For those applications you were forced to blit the buffer, not swap the pointer and take a performance hit.
I also remember a media player being called out by name in the code for doing invalid operations, needing a work around and code to detect it was running just to function.
Guvante 3 hours ago [-]
It goes the other way too, sometimes you trigger some optimization silliness in the driver and the game needs to adapt to avoid it.
rickdeckard 3 hours ago [-]
then the driver gets updated and the game either continues to optimize (wrong) or branches out into code that was written before that driver came out and generally wasn't that well tested, and the circle continues...
It's the life of a (game) developer...
anilakar 4 hours ago [-]
GPU driver packages are already a huge collection of workarounds for bad game engine coding.
An Nvidia employee once told me that one of the easiest ways to squeeze out a few extra frames on your old machine is to rename the game executable to hl2.exe.
st_goliath 3 hours ago [-]
> GPU driver packages are already a huge collection of workarounds for bad game engine coding.
And of course, browser engines also do the same things for certain websites:
I can see how it can modify GPU driver behavior, but I cannot see how it would get you better performance with everything else the same?
What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
Unless they are doing some silly things like dropping quality, but that's the "everything else the same" point.
If not, why not have this enabled as default behavior instead?
dlcarrier 2 hours ago [-]
I wouldn't be surprised if it made other games on the Source engine faster, but everything else slower.
limflick 3 hours ago [-]
> to rename the game executable to hl2.exe
This seems genuinely unbelievable. Does anyone have a technical explanation for this?
hurtigioll 3 hours ago [-]
gpu drivers detect games, among other thing by looking at executable names
then driver "optimizes" behavior, sometimes dishonestly (reducing precision), sometimes honestly (working around game engine stupidity)
limflick 3 hours ago [-]
Couldn't that also cause glitches since optimizations meant for HL2 might not work for, say San Andreas? I understand some optimizations might be universal but I can't help but think about unexpected behavior.
ChocolateGod 3 hours ago [-]
Yes.
A lot of people use Nvidia profile inspector to enable reBar on all games and claim that Nvidia is purposely holding back performance, but doing this causes many games to crash.
tester756 3 hours ago [-]
Who's problem is this?
Nvidia probably doesnt officially say anything about this and 99.9% of people do not rename process name
account42 59 minutes ago [-]
It's definitely Nvidia's problem if this breaks something. Nothing in the D3D/OpenGL specs says that you can (not) use certain executable names.
redsocksfan45 16 minutes ago [-]
[dead]
limflick 2 hours ago [-]
Phrasing, I wasn't blaming anyone, just curious about the technicalities.
hurtigioll 2 hours ago [-]
of course they do.
nvidia even has an official api for a game to identify itself so they dont need to look at executable name
proton_9 4 hours ago [-]
This sounds like a really interesting story, would like to read more on why half life 2 specifically? the game itself was pretty well optimized and ran on really low end hardware even back in the day.
db48x 3 hours ago [-]
Because everyone reported performance metrics using it as a benchmark. Higher number = more sales.
murderfs 3 hours ago [-]
If you go back 5 years, everyone was using Quake 3 Arena as the benchmark. ATI got in some hot water because if you renamed quake3.exe to quack3.exe, your FPS would drop by 15%, because they were silently reducing quality to juice their benchmark numbers.
jkrejcha 3 hours ago [-]
Apparently people did this with the DirectX "3D Tunnel" demo as well[1] back over 20 years ago.
Also there was one "that checked if you were printing a specific string used by a popular benchmark program. If so, then it only drew the string a quarter of the time and merely returned without doing anything the other three quarters of the time".
At least these actually make things faster usually.
kazinator 3 hours ago [-]
> Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop.
Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.
Most stack allocations in the wild are not checked.
i_don_t_know 1 hours ago [-]
IIRC you have to probe every page of the stack on Windows. You cannot just subtract a value from ESP/RSP. If you don't probe every page in order, you get a page fault or some other exception (I don't remember which one).
ashdnazg 1 hours ago [-]
I worked on a transpiler from Nand2tetris assembly to WebAssembly, and had some really annoying memory corruption bug that I just couldn't solve.
That is, until I checked the program I used for testing (which I didn't write), and found the following code:
dealloc(this)
return this->field
With the original allocator, this worked fine, since the deallocation didn't touch the memory.
My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.
Unlike TFA, I had the luxury of just fixing the test program.
wazoox 1 hours ago [-]
IIRC, one of the similar old story from Raymond Chen is about SimCity 2000, that did a similar trick (free memory, then start immediately using it) that worked just fine under DOS, but was a big no-no starting with Windows 95. The game was so common that Windows had to include a special rule to make it run...
classichasclass 5 hours ago [-]
Betting Alpha was the native architecture in question. It seemed to have the best support.
electroglyph 4 hours ago [-]
heh, when Raymond Chen dunks on the MSVC team =)
jeffbee 4 hours ago [-]
People from Transmeta told me stories about how their translators were full of special case optimizations to fix horrors they discovered in Microsoft Windows itself.
wolfi1 3 hours ago [-]
speaking of which, what became of it?
hbbio 1 hours ago [-]
Acquired by a patent monetization business...
notorandit 5 hours ago [-]
> they fixed it during emulation
It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.
Which would have made it an emulation code escape.
m1r 5 hours ago [-]
Couldn't they just turn the optimization off for this loop?
MadnessASAP 5 hours ago [-]
They didn't have the code for the offensive program, they were creating the emulator to run it on a different architecture.
McGlockenshire 4 hours ago [-]
> offensive program
Agreed.
notorandit 5 hours ago [-]
Which optimizer replaces a 64k loop with 64k instructions?
Ah, yes. Microsoft's!
selcuka 4 hours ago [-]
There is no indication that the compiler that produced the code was Microsoft's. Actually the article hints otherwise ("[...] whatever compiler was used to compile this code").
ant6n 3 hours ago [-]
Arguably more of an optimization, rather than a fix. Looks like un-unrolling a loop, or better, rolling a loop. Or rolling straight line code?
senfiaj 38 minutes ago [-]
Yeah, but after a certain point the win is negligible. Huge code can also increase cache misses which will slow down things.
yieldcrv 4 hours ago [-]
> All in all, it took this program 256 kilobytes of code to initialize 64 kilobytes of data.
There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.
When I started digging into it, I realized the reason was the game was using something like
instead of Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.
Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.
Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?
To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)
At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.
One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought
I feel like I'm having a stroke trying to read this, what does it mean??
What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?
Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.
(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.
dd bs=10k count=1 is faster than bs=1 count=10k
I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.
But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!
Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.
I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.
The optimizer was allowed, but not obligated, to transform that into: x = a
However, in this case, b was sometimes 0. And if so, the unoptimized version computed: x = a / 0 * 0 = Inf * 0 = NaN
So badness ensued if the that particular path didn't get optimized, which could happen under various circumstances. We had to add some code to ensure that transformation always happened on that game.
I have very little understanding on how allocation works at OS level, but I'm surprised there are no wrappers like dgVoodoo or dxWrapper specifically for this kind of issues. There are quite a bunch of old Windows games (Need for Speed 1-4 for a start) that refuse to run on modern OSes due to rather...bold memory management strategies.
[1] - https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...
I also remember a media player being called out by name in the code for doing invalid operations, needing a work around and code to detect it was running just to function.
It's the life of a (game) developer...
An Nvidia employee once told me that one of the easiest ways to squeeze out a few extra frames on your old machine is to rename the game executable to hl2.exe.
And of course, browser engines also do the same things for certain websites:
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
Unless they are doing some silly things like dropping quality, but that's the "everything else the same" point.
If not, why not have this enabled as default behavior instead?
This seems genuinely unbelievable. Does anyone have a technical explanation for this?
then driver "optimizes" behavior, sometimes dishonestly (reducing precision), sometimes honestly (working around game engine stupidity)
A lot of people use Nvidia profile inspector to enable reBar on all games and claim that Nvidia is purposely holding back performance, but doing this causes many games to crash.
Nvidia probably doesnt officially say anything about this and 99.9% of people do not rename process name
nvidia even has an official api for a game to identify itself so they dont need to look at executable name
Also there was one "that checked if you were printing a specific string used by a popular benchmark program. If so, then it only drew the string a quarter of the time and merely returned without doing anything the other three quarters of the time".
[1]: https://devblogs.microsoft.com/oldnewthing/20040305-00/?p=40...
Windows 95 patched a bug in SimCity just to get it to work.
I agree it would be stupid for a compiler to even support such a flag, but those were the 1980s/90s.
https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.
Most stack allocations in the wild are not checked.
That is, until I checked the program I used for testing (which I didn't write), and found the following code:
With the original allocator, this worked fine, since the deallocation didn't touch the memory.My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.
Unlike TFA, I had the luxury of just fixing the test program.
It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.
Which would have made it an emulation code escape.
Agreed.
Ah, yes. Microsoft's!
solidity sweating profusely