Colleagues, hello! Today we will talk a little bit about the memory organization in microcontrollers, how it is allocated to our needs, how to use it effectively and how not to waste it unnoticeably. The theoretical part will be a little bit generalized and simplified, without excessive details so as not to complicate the video. But as for the examples, they will be quite detailed.
For simplicity, almost all of them will be for the Arduino platform and the AVR architecture, but a little bit, tangentially, we will touch on both STM and ESP. At the same time, we will focus not on the internal structure, but on the logic of work and practical techniques. You will find all the links to the sources in the description. There's quite a lot of material, so let's get started right away! In general, the memory of the microcontroller will be built more or less as follows: Usually we will have a certain amount of Flash memory onboard, containing our firmware, or a bootloader, or both. In the Arduino Nano, there are 32 kilobytes of read-only,
non-volatile Flash memory. As you understand, it will not necessarily be completely filled, some part will remain free, and in some cases we can even use it as an additional programmable ROM for our own needs. In AVR microcontrollers, this is slightly more complicated, because in the ATmega series, execution of write instructions to Flash memory is possible only from the boot section and is prohibited for the section in which our program is located, but in STM there are no problems with this. By the way, the bootloader does not have to be at the end of the Flash memory, it depends on the architecture of the MCU. It is worth clarifying here that writing to ROM is performed in two steps:
first erasing, and then recording indeed. And unlike EEPROM memory, Flash memory is erased page by page. I.e. you cannot erase less than one page, the size of which depends on the model of the microcontroller. For example, in ATtiny10 it is 16 bytes, in the 328th ATmega
it is 128 bytes, in the STM32F0 series it can be 1 or 2 kilobytes. Usually, the larger the Flash size, the larger the size of a single page will be. The recording can also be page-by-page, or one cell at a time, this information can be found in the documentation for the microcontroller. Next is the RAM. Depending on the architecture of the microcontroller, it can be in the same address space with Flash memory or in different ones... In general, memory addressing can be
quite complicated, for example, some parts of the RAM address space can be used to address processor registers, as well as configuration registers and work with internal blocks and peripherals. Depending on the MCU model, the built-in RAM can be supplemented with external RAM. But we won't go into such details now, and we'll just assume that we have some RAM that we can use to run our program, to store its variables and data. And another one type of memory that we'll talk about a little bit today is an Electrically Erasable Programmable ROM, or EEPROM. As the name implies, this type of memory also
requires erasing before recording, but you can erase and record it not by pages, as in the case of Flash, but by one cell at a time. This is why it is convenient, but its size per unit area of the die is less than the size of Flash memory, which fits on the same area. And in microcontrollers the size of EEPROM is usually much smaller than Flash. Well, actually, you know it all without me. And then the most interesting thing is the logical organization of it all! Let's skip all the boring schemes for now and immediately turn to practice. Let's compile such a simple code for the ARM architecture, specifically for Cortex-M0. Yes, the core isn't fresh, it's from 2009, but it's still very common.
Here we have two global variables, one of which is initialized with the "2" value, and the other goes without explicit initialization. And then we perform some random actions with these variables, in general, it doesn't matter what. And then the first lifehack: look, with such a compiler directive you can set the desired optimization level of the compiled code. And then all the code below the directive and to the end of the current file will be compiled with the level you set. In addition, you can set the optimization level for each function separately using this attribute. Both can sometimes be very convenient. That is, you can, for example, optimize some of your functions in terms of execution speed, and everything else in terms of code size. It's quite a working option. Here is a list of optimization options that
are used most often. In fact, each of these options simply activates a specific set of more, let's say, low-level optimization options, there are quite a lot of them, and if necessary, you can study them yourself. In the pragma directive and in the attribute, a dash is usually placed before the "O" option, but it seems to work without it. In this example, I disabled optimization, otherwise the compiler will reduce those code fragments that it considers meaningless, and since no reasonable actions are performed here, almost everything will be cut out. We don't need it.
OK, let's compile an object file and run such utility to understand how much and what type of memory we took up. You've probably seen the results of its work if you've compiled any projects in almost any development environment. A similar utility is also called for AVR in Arduino. The utility can be run with different options, let's now look at the results with the -B option. We have three main sections here, these are .text, .data and .bss.
In fact, each of these sections may consist of several more specialized subsections, and we will look at some of them today, and leave some behind the scenes. But for now, let's assume that we have only three main sections. The .text section is our compiled code and it also includes a set of constants that are not used to initialize variables, but are used directly from the firmware. We'll see how it works later. The .data section is a bit more complex: these are also constants from the firmware, but these are the constants that
initialize global or static variables. If we take a look at the code once again, we'll see that there are only two global variables. Since this is a 32-bit architecture, the 'int' type will occupy 32 bits (this is not always the case and we will come back to this later). Of the two global variables, only one is initialized - this is var2, the initial value of which is '2'. This '2' in 32-bit version is all our content of the .data section. I.e. look, the text section will finally be represented only in the Flash memory of the microcontroller, and the .data section
will be in both Flash and RAM. Because Flash will contain the initial value of the global variable (in our case, '2'), and this global variable itself, i.e. var2, will be stored in RAM. The .data section occupies both Flash memory and RAM. Sometimes, however, the utility for displaying the size of object or executable files sums up the sizes of the .text and .data sections and displays them in one number. It depends on the specific implementation and the selected display option. In this case, they aren't summarized, and as a result, our firmware will occupy not 40, but 44 bytes of Flash memory, and the initialized variable in RAM will occupy 4 bytes.
Besides the .data section, there is a .bss section, which contains global and static variables that are not explicitly initialized. In fact, this section can also consist of subsections, but here it's just one global variable var1.
By default, all global or static variables for which no initial value other than zero is specified will be assigned zero values before starting the program. And even if we explicitly assign a zero value, such a variable will still get into the .bss section and will also be reset to zero before starting along with all the others. A special code is responsible for initializing variables from the .data and .bss sections, which, of course, is also located in the .text section and which is run before starting the main program. Next, the utility outputs the sum of the sizes of all
sections in decimal and hexadecimal versions. The numbers are pretty useless in our case, because for us there is no practical sense in the sum of the amount of firmware located in flash memory and the amount of global variables located in RAM. Now let's add two more string variables to our code, one of which we'll make a constant. We're doing something with them here, no matter what,
we're only interested in the sizes of our sections for the moment. Each constant takes up 16 bytes. I did this on purpose, because depending on the architecture, the compiler can align the sizes of sections by adding empty unused bytes to variables so that the size of arrays (or fields in structures) is a multiple of the width of the bus. In our case, it's 32 bits or 4 bytes. There is nothing to align here, and therefore the result will be as clear as possible. And by the way, regular viewers of my channel already know that you can avoid alignment for structures using this attribute. This is usually not required, and moreover, compiled code that accesses unaligned fields of the structure will be less efficient. But sometimes the use of this attribute is necessary, for example, if you use such structures to organize data exchange with some other device... one more lifehack.
So, we added one string variable and one constant... compiling... and we see that the size of the .text section has increased, which is understandable. The size of the .data section has increased by 16 bytes due to the var3 variable, which is not a constant and to which we assign the initial value. And the .bss section remained 4 bytes long, as it was before. So, if we declare some kind of constant, then for ARM/Cortex it means that its value will be located only in the .text section. And our program will access Flash memory directly
during operation. But if we need a variable, then we do not specify the 'const' keyword, and then a place in RAM will be allocated for this variable, and before starting it will be initialized with a value located in Flash memory. That is, as in the case of the var2 variable, our string variable gets into the .data section, which takes up space in both ROM and RAM. Now let's add a function to our code
in which we will declare two local variables. Compiling... and we see that only the .text section has increased, and the .data and .bss sections have remained the same as they were without the new function and its variables. This happens because local variables declared inside procedures and functions are placed either in processor registers (if available) or in a special memory area called the Stack. Unlike the .data and .bss sections, the Stack is implemented in hardware and is dedicated to store data used during program operation. These can be local variables, or if there are not enough registers to perform the current task, you can free registers by temporarily putting their values on the Stack, and then returning them back. In addition, the Stack is used during subroutine
calls and during interrupt handling. As a rule, the Stack is located at the very top of RAM and the processor has a special register that points to the current top of the Stack. At the same time, the Stack grows in the opposite direction, i.e. from higher memory addresses to lower ones. When we put a 4-byte variable on the Stack, a '4' will be subtracted from the pointer to the top of the Stack (stack pointer). When we take this value back from the Stack, on the contrary, a '4' will be added to the pointer. It should be clarified here that in most cases the stack is located in the internal memory of the microcontroller. Since it is very actively
used, it is highly desirable that access to the Stack be as fast as possible. However, after initialization, in some architectures it is possible to move the Stack to external RAM. For actions with the Stack, special instructions are implemented in the processor.
We will not dwell on the topic of the Stack now, there are generally quite a lot of details, for example, related to the organization of multitasking, when for each task a separate memory areas are allocated for its own stack. Let's just assume that at the very end of the RAM we have a Stack growing towards the lower memory addresses. And look, here we have to initialize the variables ourselves. If we refer to a variable before we assign it a value ourselves, we get the value that was in the memory cells allocated in the Stack for this variable. And there can be anything. That is, when allocating memory, the value corresponding to the required amount of memory is simply subtracted from the pointer to the top of the Stack, and then our local variables are located inside the allocated area at some fixed offsets relative to the new top of the stack.
No additional actions are performed to clean this area. At the same time, if we use any local constants... for example, here I have an array of numbers that may well be a constant, then in terms of the resulting amount of code, it may be advantageous to make such a constant global, and then instead of messing with the Stack, the procedure will simply refer to the value specified in the text section. Alternatively, you can make this constant static, but declared inside the procedure. If we declare a static variable or constant
inside a procedure, then in the case of a variable it will be located not in the Stack or processor register, but in the global variables section, i.e. either .data or .bss, and in the case of a constant it will be placed in the .text section. However, they will only be available inside the function or procedure where we declared them. For example, if the var5 variable is also made static, it will be placed in the .bss section, reset to zero before starting the main program, and will retain its values between calls to the function. Before going next (and the most interesting things will be next), let's summarize a little what we have found out so far.
So, we have ROM, RAM, and sometimes there may also be an EEPROM, a type of programmable ROM that allows data to be written to it more flexibly, but with a smaller amount than Flash memory. The Flash memory contains the .text section, which contains the initialization code, our firmware code, and constants. And also the .data section, which also contains constants, but only those that will be used to initialize global or static variables in RAM. In RAM, the .data section is usually located at the very beginning. Next comes the .bss section, which contains global variables initialized with zeros. That is, before launching our firmware, the initialization code is first run,
which simply copies the .data section from ROM to RAM and fills the .bss section with zeros. But at the very beginning, this code also performs the initialization of our microcontroller, in particular, it sets the pointer register to the top of the Stack, which is usually located at the end of RAM and grows down towards the .bss section. And, by the way, if we don't use RAM efficiently enough or make a bug in our firmware code, the Stack can grow so much that it will creep into the .bss section with unpredictable results, most often leading to a hanging up or resetting of the microcontroller. In some architectures, such a collision is monitored by hardware and causes an exception. But this is not always the case.
The .text, .data, and .bss sections are LOGICAL structures, they are not fixed at the hardware level in any way. Unlike, for example, the Stack, for which there is a special processor register pointing to its top. Working with memory can include other hardware solutions,
such as access control or collision and overflow control, but in the minimal variant, only Stack operation is implemented in hardware. Now let's compile the same program for our beloved Arduino Nano and see if the AVR differs from the ARM in terms of programming. It is clear that due to the different set of AVR instructions and the Arduino core code, the .text section will differ in size.
And since AVR, unlike Cortex, is an 8-bit architecture, 'int' variables will also differ. They won't take up 4 bytes, but only 2. We will take this into account. Compiling... And what do we see? 46 bytes of RAM are used! In theory, if 'int' in AVR takes up less space, then we should have less RAM consumption, but something went wrong... Let's see what we've got in sections. To do this,
export the compiled files and run the avr-size utility with the -B option. The .bss section predictably takes up 2 bytes. This is the global variable var1, which we use without explicit initialization, and therefore it will be assigned to 0 before starting the program. But for some reason, the .data section has grown from 20 bytes to 44. What's up? The fact is that the AVR has a Harvard architecture, unlike the Cortex-M0, which is based on the von Neumann architecture (also known as the Princeton architecture). In the von Neumann architecture, both commands (i.e. our program) and data (i.e. variables) are located in a single address space and use a common data bus and a common address bus. Therefore, in the case of the Cortex-M0 core, there is
no fundamental difference between instructions and data for the processor, and it can easily access the data located in our .text section, i.e. in Flash memory. Conversely, it can execute a program located in RAM. There are no problems with this. By the way, if you remember, Z80, on which the good old Speccy is based, has the same architecture. But the Harvard architecture uses separate buses for code and data.
That is, the program is executed from one address space related to Flash memory and intended for storing processor instructions, but the instructions themselves work with data located in another address space - in RAM. You will no longer be able to run the program from RAM. As well as accessing the data in the ROM. Of course, there are special instructions that allow you to access data from Flash memory, otherwise how would we initialize variables? But these are just special instructions, and all the usual work is done only with data located in RAM. But if we still want to take and process some data from the ROM, then this will require more code and it will work slower. That is why, by default, in the Harvard architecture, even constants are placed in the .data section, i.e. RAM is allocated for them
and their values are initialized before running our program. It seems at first glance that it is not entirely clear why all these difficulties are needed, but such an architecture has a serious advantage: the fact is that separate buses allow you to parallelize access to ROM and RAM. And if we have everything according to the rules: instructions in ROM, data in RAM, then the Harvard architecture will be more productive than the von Neumann architecture. Now that we know all this, let's take another look at the program: we have an initialized variable var2, which is two bytes. An array of variables var3 of 16 bytes, a constant array var4 takes another 16 bytes, and a static constant array of 10 bytes declared locally in the procedure. Since we are dealing with the Harvard architecture, it means that RAM will also
be used to work with constants. Total 2 + 16 + 16 + 10, we get 44 bytes in the .data section. It's not the most efficient use of RAM, especially since there's not much of it here. To solve this problem, the PROGMEM macro is used, which hides the attribute of the same name __progmem__. By applying this attribute, instead of the .data section, we can place our
constant in the .progmem section, which is part of the .text section, and therefore is located only in Flash memory. In principle, exactly the same effect will occur if we explicitly specify the .progmem section to the compiler. It seems like the __progmem__ attribute, compared to the section attribute, implies some additional checks at compile time, but there is another small difference, which we'll talk about later. If you don't mind, I will not
talk now about using PROGMEM and other useful macros for storing and working with constants in Flash memory. All this has already been discussed many times and there is no point in repeating it once again. The only thing that should be clarified separately, that you should not worry about any such small constants. They will be hardwired into the processor instruction code, so they will only be in the .text section and, unlike arrays,
you don't need to do anything extra with them. But you know that without me, so let's better pay attention to a couple of non-obvious points about PROGMEM. The first one is this: if you have an AVR architecture chip with more than 64 kilobytes of Flash memory and your firmware, along with constants and code, takes up more than 64 kilobytes (otherwise, why would you buy such a chip), then some part of the firmware will go beyond the 16-bit address space limit, that is, these 64 kilobytes. If this is data from the .progmem section, then to access it you will have to use not pgm_read_byte (word/float/etc...), but pgm_read_byte_far, a macro that allows you to access data in flash memory located beyond 64K. And this, as you might guess, is slower than accessing a 16-bit address.
But there is another problem with the .progmem section going beyond 64 kilobytes. By default, this section is located in front of the code section and you can't just move it somewhere. Which, in general, is quite reasonable. If you have a lot of data, then you probably need it and you will often access it. This means that it is better to place them in the first 64 kilobytes to ensure faster access to them.
However, if you are working in an Arduino environment, then the problem may be that there are some important tables in the same .progmem section, which, for example, correlate the Arduino pin name with its port and the corresponding to pin bit number in this port. And the most annoying thing is that these tables are placed in the .progmem section AFTER the user constants. Therefore,
if you want to put a lot of data in .progmem, then these tables can jump beyond 64 kilobytes, and access to them is provided only at a 16-bit address. Your program will be able to work from anywhere in the ROM, nevertheless in the first 64 kilobytes or in the last. Yes, there are some kludges associated with calls to functions located far. Such calls will take a little longer to complete. But in general, there shouldn't be any special problems. Apart from the fact that due to the lack of access to these tables, everything will break down for you.
This problem has been known for a long time, but it still hasn't been fixed. I've checked. And there are a couple of lifehacks to solve this problem. First of all, remember, we said a couple of minutes ago that instead of PROGMEM macro, you can use such attribute. By the way, we can also put it into a macro. And if we do this and mark our large constants with a new macro,
then in this case the compiler (actually a linker, but we won't go into details), so the compiler will put these constants AFTER the Arduino tables. The program will still be at far ROM addresses, behind the progmem section, but everything will work. On the other hand, if your data still goes beyond these magical 64K, then you will still have to access it using pgm_read_byte_far, or some function with the _PF postfix. And if so, why not to put large amounts of user data at the very end AFTER the program? To do this, using the same attribute, we can specify one of the .fini sections instead of the .progmem section. There are 10 of them, and the .fini2 section is usually chosen, since it is custom and not occupied by anything. But you can take another free section. Here we also see the .init sections, we talked about them,
they contain the code for the initialization of the microcontroller, of global constants, of .bss section and C++ objects. And the .fini sections serve to complete the main program correctly, which is actually almost never used. There is usually some kind of endless cycle. And the most important thing is that they are also integral parts of the .text section and are located at the
very end of it. And as a result, using these two life hacks, you can place the most sensitive constants in the .progmem section, with a guarantee that access to the tables we talked about will not be disrupted, and all your heavy arrays will be moved to a distant memory area. Yes, access to them will be slower, but again, if you have, for example, some icons for menus and they still don't fit in 64K, then one way or another you will access what didn't fit using heavier and slower commands. And then it makes sense to put all the graphics at the end of the firmware. You can create another macro, name it PROGMEM_FAR and use it. In the description there will be a link to another example
in which these lifehacks are used, and you can see the result with such a command. There will also be a script that runs it. Just don't forget to change paths to your own. All this, of course, applies only to chips with a Flash size of more than 64K, and it is clear that all this voodoo is due to a rather "kludgy" architecture that was not originally designed to work with such sizes, but, nevertheless, the architecture is still common, still popular, so I think that it would not be superfluous for us to have some clue about how it works. The situation is similar with the old ESP8266, where the latest tricks with attributes and the .fini section will not
work, since there is no problem with 16-bit addresses, but the PROGMEM macro is relevant there. Well, one more thing: many modern microcontrollers have a Harvard architecture, i.e. they have the advantage of parallel access to instructions and data, but they are equipped with additional features that partially or completely avoid the disadvantages associated with difficult exchange between ROM and RAM address spaces. You may be surprised, but the avrtiny architecture (which, for example, is used in the ATtiny10) and avrxmega3 (megaTinyCore) are able map Flash memory to a specific area in the data address space. This significantly speeds up access to constants and eliminates the need to use PROGMEM.
In ATtiny's based on these cores, PROGMEM is not needed. Therefore, if you once did projects on ATtiny not of the classic avr25 architecture, but on more recent cores and at the same time used PROGMEM, then it makes sense to revise such projects. If we talk about progressive chips, such as ARM Cortex-M4/M7, which are also based on the Harvard architecture, then they allow both: access to data in ROM and execution of programs from RAM, all this beauty is covered by a bus matrix and is translated into a single address space.
All the advantages of Harvard and Princeton architecture are at your service. Friends, the video is already quite long, and I have quite a lot of interesting tricks more for you. And I really hope that you will learn something useful for yourself. Therefore, if you write in the comments what exactly you found interesting, or maybe you found nothing interesting, it will help me better understand the audience and choose more appropriate topics for the next videos. Thank you in advance! Let's move on! We have delved a little into the topic of sections, now we can roughly imagine how they are formed from the source code and how they then look in the memory of the microcontroller. Let's recall
the .bss section again. It contains global and static variables that were not explicitly initialized in the code or were initialized with zero values. Before launching our program, the entire .bss section will be filled with zeros. But what happens if we don't initialize the global variable with zeros? Well, during the first launch of the microcontroller, there will most likely be some garbage, but if we assign some value to this variable, then after clicking reset, the value of such a variable will remain. Because reset does not turn off, but only restarts the microcontroller. And if we consider that we can reset not only by a button, but also, say, by a watchdog timer, then in principle, we can use some kind of uninitialized variable to quickly restore operation after a reset or after a failure. For example, you can save the parameters of the current
operating mode in the EEPROM every 10 minutes, in case of a complete shutdown, and use uninitialized variables to restore after reset. Here, in my example, I just count the number of resets since switching on. This is all done by specifying a special .noinit section for the variable. Mentioning EEPROM, it's worth mentioning one more thing: if you used EEPROM in the Arduino environment, you probably did it using standard libraries. In this case, the variables that you want to place in the EEPROM can be marked with the EEMEM macro, which hides the same attribute of specifying the section, in this case it is called .eeprom.
Basically, you can just access certain offsets inside the EEPROM and not declare any variables. But the trick is that you can set initial values for such variables and the compiler will combine all these values into one binary file with the .eep extension. And then, as always, the problems begin. The Arduino environment uses the avrdude utility for burning the AVR microcontrollers.
And theoretically, it would be possible, along with the main firmware, to immediately burn the initial EEPROM values into the microcontroller. But the first problem is that not every bootloader allows you to do this. The Arduino now includes a fairly old version of the optiboot bootloader. It is cumbersome and does not allow programming the EEPROM. At the same time, the bootloader, which is designated as "Old Bootloader" in the processor selection menu, allows to burn the EEPROM via the COM port. It really works, but unfortunately it's quite clumsy and with it, for example, I was unable to implement the previous lifehack with the .noinit section. There was no time to understand in detail what was going on there, but, as they say, some after-pain remained. The more recent versions of optiboot
are very cool in terms of size and speed, but they also do not allow EEPROM burning. Therefore, it is easiest and most reliable to do this using a programmer and the same avrdude utility. That is, instead of initializing the EEPROM memory from the program, spending some amount of code and constants on it (which is doubly annoying, use the constants of one ROM to initialize another ROM), you can do just that. And in order not to search for the right options every time, not to type paths and not bother with the command line at all, you can use the coolest interface shell for avrdude, which is called AVRDUDESS. I highly recommend it, if you suddenly still didn't know about it, this program will make your work with AVR much easier and more convenient.
Export compiled binaries from the Arduino IDE, then open the EEPROM dump binary in the AVRDUDESS and burn. The fuses can also be burned and the main firmware. Everything is very convenient. The link will be in the description. BTW, I forgot to say that the Arduino environment does not even try to burn the EEPROM. Even if you connected via programmer. Well, you can probably trick with some scripts... But f@ck it.
Next. Since we are talking about the bootloader, I highly recommend to install and burn a fresh optiboot. It takes up only 512 bytes and the rest of the Flash memory space will be free for your firmware! Isn't this a miracle?! But something else is much more interesting: we have already said that Flash memory reprogramming instructions can only work from the area of the Flash that is intended for the bootloader. It's generally clear what it's done for: so that the bootloader can program the microcontroller, but the user can't, otherwise he would mess up, damage the bootloader, etc. etc. But we are responsible coders! We won't mess up, and if we do, it won't be much. So we want to be able to write to Flash memory on our own! And optiboot provides us with such a possibility. We can call its internal functions, and since their code is located in the bootloader's memory area, the restriction on writing to Flash memory does not apply to them! So if your firmware takes up, for example, only half of the available Flash memory, then why not to place storage for some values in the second half, like EEPROM but with larger size. The problem here is that,
as we have already said, Flash memory is erased, and sometimes it is written not byte by byte or even in words, but in pages. That is, you will have to write the entire page at once, which, for example, is 128 bytes for ATmega328P. But this is not such a terrible limitation. In the same example that I will attach to the video, after each reset, the contents of the constant in the Flash memory are updated. So that in the string "Hello world", instead of each
vowel letter, a random vowel of the English alphabet is substituted. Once upon a time, during my school years, for the sake of laughter, I wrote a simple DOS .com-virus that reprogrammed the video adapter in a similar way, and in Norton it looked quite funny. The next topic is dynamic memory allocation. Let's recall once again our memory map and which sections are located in which places. Here we have one more section .noinit, which, although it is a
separate section, is quite often considered as a part of the .bss section. We have a Stack growing downwards and located in the upper addresses. And there is another area located above the .bss and .noinit sections, intended for dynamic memory allocation and called the Heap.
The Heap, as the name implies, is growing upward. I.e. the Stack and the Heap grows towards each other and if this growth is not controlled, they will meet, it turns out that the same memory area is used at the same time for storing different data and... Therefore, it is usually recommended not to use dynamic memory allocation in microcontrollers at all. Unfortunately, this is not always possible, because there could be some cases when we do not know in advance exactly how many objects we need to allocate memory for in order to do this in a static way. For example, we may not know how many temperature sensors or other devices we have on the OneWire or I2C line. How many of them the user
will connect - that's how many of them will be there. And for such a situation, there are two options: either pre-allocate memory for a certain maximum number of devices that we are ready to handle with our microcontroller. This is a more reliable way to control memory utilization, but it is extremely inefficient. That is, we can allocate memory for, say, 8 sensors, and the user will connect only one. 7 memory areas will be idle. It's not good. And to prevent this from happening, we can dynamically allocate memory for objects as they are discovered. Then we will
take up exactly as much memory as we need, and we can use the rest for other tasks. It sounds nice, but, as usual, there is a nuance called memory fragmentation. In general, the memory allocation algorithm is quite interesting, so let's take a closer look at it. For dynamic memory allocation, we have the standard 'malloc' function, and if we allocated memory blocks simply one by one and did not control the release of previously occupied blocks in any way, then we would run out of memory instantly. This means that we need some kind of mechanism for reusing memory when it is released. We cannot know in advance which blocks will be released in what order, and even worse, we do not know where in our firmware pointers to still occupied blocks will be stored.
When some blocks are released, there is no way we can move the occupied blocks to the empty spaces without spoiling the pointers to the occupied blocks. Well, if we have full control over all the libraries and are sure that no memory is allocated dynamically in any of them, then in principle we can pull off such a trick, but also with certain restrictions. This means that we need to work with what we have, i.e. try to place new blocks in place of those that have been released. To do this, we need to have some kind of table that lists all the available blocks, their sizes and locations. The size of this table itself is also not known in advance, but it needs to be stored somewhere. The simplest and therefore ingenious solution in this case is to store a list of free blocks in these free blocks themselves. They're free! In this case, each such empty
block contains its own size and a pointer to the next free block. And so on and so on until the last empty block, which no longer points to anything. The address of the first block is stored separately in a special pointer. With this approach, the block size cannot be less than the sum of the sizes of the block length field and the pointer field. In the case of AVR, this is 4 bytes total. And keep in mind that the size field is also used for occupied blocks. If we
call the 'malloc' function with the parameter '1', i.e. we want 1 byte allocated to us in the Heap, then in fact 4 bytes will be occupied in the Heap, of which two will be used for the block size field, one for our needs and one will not be used. You have already guessed how the 'free' procedure for releasing the block works: it finds a position for the freeing block in the list of free blocks and inserts a new block in that place. And then, the next time 'malloc' is called, the entire list of free blocks is first traversed, and if a block is found that exactly matches the required size, the algorithm will exclude such a block from the list of free blocks and return a pointer to it. But if there is no exact matching block size,
then the function will select a block that is minimally different from the required size and if the difference is 4 bytes or more (i.e. another free block will fit in there), then such a block is divided into 2 parts, and in order to minimize the number of steps and increase the speed of work, it is made so that the first part of the block remains free, and the part that is closer to the end is occupied. Well, if it cannot be split, then the new block will take up the entire amount of free space, just from 1 to 3 bytes will not be used.
Why am I telling you all this in such detail? Let's look at this example: let's say we first need to allocate memory for an object of 10 bytes in size. Then the cycle begins, in which we immediately free up memory from this object and allocate two more blocks, one short of 6 bytes, and the other of 10. And the cycle repeats. That is, every iteration between cycles, we have a number of small objects in memory, followed by one large one. Then the large one is released, a small one is put in its place, and after it, the memory for the big one is allocated again. Large objects
don't stay long, while small ones accumulate. We have a certain amount of free memory, and our task in this example is to estimate how much of this amount we can use for our needs and how long it will take for these operations. Here I estimate the amount of remaining free memory as the difference between the top of the Stack and the top of the Heap and stop the loop when this difference becomes less than 512 bytes, so that we are guaranteed to have room for normal Stack operation. Launching it... and we get that out of the available size of 1332 bytes, we managed to use
1000 bytes or 166 objects for our purposes. Let's check it out: this should be 165 objects of 8 bytes each (two additional bytes per block size field) and 1 object of 12 bytes: 165 * 8 + 12 = 1332 bytes, that's right. The cycle took 4 milliseconds to run. And now, in the same program, just move the memory allocation line for a large object two lines higher inside the loop. From the point of view of the algorithm itself, little has changed. As before, large objects do not live in memory for long,
although there are moments when there are two of them in memory at once, and small ones gradually accumulate. Launching... and we see a completely different result: out of the same size of 1332 bytes, we managed to take only 670 bytes for 111 objects, and it took 8 times longer: 32 milliseconds! Let's figure it out. Look, at the beginning, we allocate space for a large object of 10 bytes (we remember that 12 bytes are actually allocated, but this is not important now). And then we put another one of the same size and immediately release the first one. After that, we ask to allocate us a place for a small 6-byte object. We have space, there are as many as 10 bytes available. And so this new 6-byte
block is placed at the end of the free block, and the first part of it remains free. And then the cycle repeats. Now we can only put a large object at the end of the Heap. Then we delete the previous one and try to squeeze in a small one. It no longer fits into the first free slot, but it fits into the second one and is placed exactly the same way at the end of the free block. And so on and so on. As a result, we get unoccupied chunks of memory that are not used in any way and just waste valuable RAM. It's clear that I came up with this example specifically for clarity, but it shows well how easy it is: to move just one line of code without changing the basic essence of the algorithm to reduce the amount of efficiently used memory by a third, and even reduce the speed of work, because every time we call malloc to allocate another block, the algorithm will go through the entire list of free pieces, and there are a lot of them here.
Of course, different architectures will have their own specificities, for example, the minimum allocated amount of memory may be larger and at the same time necessarily a multiple of some minimum size. You can study all this yourself, I'll upload the sketch, along with all the other examples. Just in case, I'll clarify once again that no, I didn't screw up the memory leak, this was done on purpose to experimentally estimate the harm from fragmentation. The conclusions from this example are as follows: dynamic memory allocation is a very useful and effective thing that allows you to work with objects whose number and sizes are unknown in advance. But using this tool on microcontrollers,
where we have a very limited amount of available RAM, must be extremely careful. Depending on the specific situation, it may be more advantageous to allocate some amount of memory for your objects in advance and programmatically prohibit processing more of them. Alternatively, if you use 2-3 classes of objects of different sizes, it may be more advantageous to adjust their sizes so that they easily fit into each other's place. Again, this method is only suitable if you are sure that no third-party libraries will request memory allocation while you are working with your objects. Perhaps, in some situations, it makes sense to delete all objects and recreate them anew with each change: we found changes in the list of devices on the line - we deleted all our objects and started the procedure for re-scanning all monitored devices.
Yes, it will take up processor time, but it will eliminate fragmentation (and once again: this is only the case when the memory isn't allocating for anything else). In general, see what's best for the situation. Further. Calculating the free amount of memory as the difference between the tops of the Stack and the Heap, as done in this example, is not particularly applicable for real tasks. Because if we can somehow control
the size of the .data, .bss, and Heap sections, then the Stack size can vary greatly as the program runs. In addition to procedure calls and even recursive calls, which greatly increase the stack, interrupts and even nested interrupts will occur during operation, and all this will also add size to the Stack. Therefore, in order to understand whether you still have some amount of free RAM, you can use a very simple but effective method of memory painting, the essence of which is as follows: at the very beginning of our firmware, we start the procedure for filling RAM with some value. This is the painting. Choose the paint color you like. My example is simplified
(just to show how it works) and in it I use the 8-bit value 'DA' as paint, but in general it is better to paint at least with words, and even better with double words. You can, for example, take something from Hexspeak. The main thing is not to choose zeros or FF's (well, it's clear why). The entire space between the Heap and the Stack is filled. And then, in some places of the program, you call the function for calculating the amount of remained paint. It is advisable that the firmware work quite long time
to capture all the unsuccessful cases of deep calls, and even with interrupts. The result will allow you to evaluate the efficiency of memory usage and the presence or absence of space for further development of your firmware. In my example it turned out that during the call to the painting procedure, 5 bytes from the Stack were already occupied, partly by this call itself. The checking function call has taken 18 bytes from the Stack, i.e., the minimum amount of free memory has been reduced by 13 bytes. During the initialization of the serial port and the short output to it, the Stack grew
to 28 bytes. Outputting a floating-point number to the port took 69 bytes, and outputting all these statistics took 109. And most of all utilizes the stack, of course, a recursive function call. Everything is elementary here: what number has been passed to the function, this number of times it will call itself. This takes 611 bytes
of the stack. And if you wait for some time, you can catch an interrupt call right at the moment of maximum recursion and the stack will grow even to 622 bytes. It's very interesting, I recommend experimenting with this example, it gives a good clue of what happens while your firmware is running. By the way, for ESP there are ready-made API functions that will allow you to evaluate all this. Well, we still have one more useful lifehack, let's take a quick look at it and call it a day. You've probably noticed that it's only necessary to add calculations over 'float' or 'double' types to your program (by the way, in the version of the avr-gcc compiler that comes with the Arduino environment, there is no 'double' type at all, or rather it is, but the same 'float' lies under it), so if you are working with real numbers in the program, then your firmware immediately "gets fatter" by 1-2 kilobytes, although you seem to have only performed a couple of simple operations. But in microcontrollers that are not equipped
with a coprocessor for working with floating-point numbers, all these actions are performed programmatically, and this requires both storage space for this code and processor time for its execution. Therefore, if you do not have any complex functions, roots and powers, but only addition, subtraction, multiplication and division (which is the most demanding of the listed ones), then it makes sense to perform all these operations on integers, and if possible, replace division with bit shifts. This is done very simply: all numbers are shifted by a certain number of bits to the left and the free bits thus formed on the right are used to store the fractional part. Everything is exactly the same as in the decimal system, only there we have digits to the left of the dot corresponding to powers of 10, and in binary, of course, these are powers of 2. And on the right, the digits
correspond in the first case to the 10 in negative degrees, i.e. 1/10, 1/100, 1/1000 and so on, and here it will be 1/2, 1/4, 1/8, 16th and so on. The more bits we allocate to the fractional part, the more accurate our calculations will be. It all depends on the task. For example, the popular DS18B20 digital temperature sensor uses 4 bits for the fractional part. That is, the minimum possible modulo number that can be written in this format is 1/16 or 0.0625. We can say that in this form, the number will be represented by a certain number of 16th fractions of 1. And when performing
operations on such numbers, in order not to lose accuracy, it is necessary to perform multiplication first, and then division, if possible. At the same time, of course, make sure not to jump out of the size of the variable and, if necessary, take an integer type with a larger width. If you need to get an integer as a result, then you simply shift the result back, getting rid of the fractional part, which is equivalent to rounding down. If you want, or rather, if necessary, you can provide the simplest algorithm for regular rounding.
If you need to calculate some kind of expression that includes real constants, then very often operations with them can be replaced by multiplication and shift. For example, for microcontrollers powered by 3.3 volts, it is convenient to replace '3.3' with the fraction 845/256 when calculating. If you need to multiply something by 3.3, then multiply it by 845, and then, at the very end, shift everything by 8 bits to the right, which is equivalent to dividing by 256. Multiplication by Pi can be represented as multiplication by 201 followed by a shift by 6 bits. Or, if you need more precision, then Pi can be represented as 3217/1024, i.e. a 10-bits shift. And so on for
any constants. Division by a constant can also be represented as multiplication by the equivalent of the inverse value followed by a shift. Dividing by Pi will be, with some precision, equivalent to multiplying by 163 followed by dividing by 512, i.e. shifting to the right by 9 bits. Well, if you really need more precision, you can multiply by 83443, and then shift by 18 bits. To do this, you will most likely need to work in a 64-bit data type, but there is only a difference in 7-8th decimal place. Of course, it will not always be possible to completely get rid of division, but even in this case, integer division will be faster and take up less space on Flash memory than software division for real types.
I have prepared an example on the topic of fixed-point integer calculations, here, according to the readings of the ADC, the voltage applied to the pin is calculated. If we take the accuracy as in the DS18B20 sensor, then there is a small difference... and if you select, for example, 8 bits fractional part, then the difference is minimal. And the gain in the amount of code is more than a kilobyte! And this is
taking into account the custom printing procedure of our fixed-point numbers. And if it is not needed, then the economy will be even greater. Yes, you lose the accuracy of calculations in this way, but I must say that operations with 'float' and 'double' types are performed with some accuracy and are far from ideal. And for most tasks, any kind of particularly high accuracy is usually not required. And what to do if you have significant amounts of calculations and their usual accuracy is not enough, we will discuss in the next video.
Colleagues, my content has been quite intense lately, that's why it doesn't come out often, so subscribe as not to miss it, it will be even more interesting further. Write comments! What other memory usage techniques did I forget to mention? Which ones do you use yourself? Please like/dislike, subscribe to my additional resources and stay in touch!
2025-01-24 14:07