[Scale] ix-applications snapshot management

Description

While WebUI already offers an option to schedule automated snapshotting with amount/limit on the number of snapshots kept restriction on the Data Protection page, the docker/kubernetes filesystems/snapshots are created/managed by TrueNAS as parts of ix-applications dataset and its subsets. User does not have any input into how often to create the snapshot or how many should be kept. This leads to thousands of snapshots taken, majority of which with 0B reported used size.

This makes it difficult to actually locate own/user-defined snapshots.

It also impacts performance of many storage layer actions, possibly resulting in lengthy start-ups/boot-ups (well into hours or days even instead of minutes) or storage management on new app deployments (well into minutes instead of seconds).

I think that we need:

an option to define/enforce max allowed snapshots on dataset (with the option to recursively apply same for child datasets) level to get TrueNAS to keep only ertain number of snapshots and automatically delete older when new is created (aka auto-rotate snapshots)
an option to enforce prohibition of empty snapshot creation on dataset level, similar to what we get when we define at snapshot creation levels (though allow setting overwrite on manual snapshots)
an option to see any rules defined by TrueNAS application management - eg. to disallow empty snapshots or limit number of snapshots created
an option to prune empty/excessive existing snapshots - even though they're not taking a lot of space on storage they're making a lot of noise in all the snapshots checks or listings
an option to squash snapshots into a bigger one containing all the changes within certain periods of time to allow automated pruning and better maagement over time
an automata to update any references to pruned snapshots to use the next existing one instead
an automata to prune old snapshots after certain period of time and process their removal in case they are used in applications historical definitions.

I do remember there were some scripts devised for either ZFS (general, eg. zfs-prune-snapshots ) or FreeNAS ZFS (platform specific for older versions of TrueNAS platform as zfs-rollup ) snapshots management/pruning. A solution similar to those should have been deployed in a production release and the functionality should have been present in the UI.

Otherwise you're risking users reporting issues with mulitples of alerts such as:

Dataset <pool>/ix-applications/docker/<random_name_similar_to_sum> has more snapshots (<your value&gt than recommended (512). Performance or functionality might degrade.

that they cannot do anything about as at that stage it's going to be difficult to delete some snapshots from the thousands that the ix-application dataset will contain (plus these might be referenced or otherwise blocked from manual easy destroy actions further making it difficult for users to detect and remove "unwanted/old/no-longer-needed" ones). Furthermore, I believe that due to this huge number of ix-applications dataset's snapshots almost all storage related actions are taking alot longer than they used to and the performance/reactivity degradation will only increase if such feature is not intruduced.

Activity

Show:

Kris Moore 
July 18, 2024 at 5:55 PM

Thank you for submitting this feature request! To better accommodate and gauge community interest for future versions of TrueNAS we have moved the submission process to our TrueNAS Community Forums. If this feature is still important and relevant for consideration, please refer to the links below on how to submit it for community voting and TrueNAS roadmap review.

Feature Requests Forum:
https://forums.truenas.com/c/features/12

Feature Requests FAQ:
https://forums.truenas.com/t/about-the-feature-requests-category-readme-first/8802

kudos.antelope 
August 22, 2023 at 3:41 PM

This is really annoying and so far I can’t find a use to justify them.

Pro’s

  • None

Con’s

  • Bloats and slows down the snapshot page

  • Local backup of ix-applications will quickly bring you over 10 000 snapshots if using many apps

  • Doesn’t feel like it complies with ZFS and doesn’t seems right since apps are from ix-systems

We want

  • 1x snapshot per app maximum so for 30 apps backed up locally for 30 days would be a maximum of 900 snapshots

  • Ability to restore an individual application snapshot (backup) from the app hamburger menu

  • Ability to restore all apps from a backup, similar to heavyscript but this time officially supported

Thank you!

TrueNAS User 
April 3, 2022 at 9:40 AM
(edited)

I'm a new user to TrueNAS and started of with TrueNAS-SCALE-22.02.0.1. I enabled automatic snapshotting today and when looking in to the made snapshots to see what it did I saw all the ix-applications docker snapshots. I have not installed any applications yet on my system, but still had more then 100 snapshots. You can remove (some of them), but many will give the error it can't be removed due to depended clone. So I vote for a feature that allows users some controle on cleaning up or a setting to configure how much retention you want on these type of snapshots.

Example error message when deleting a snapshot of ix-applications that has a dependent clone:

The error message doesn't tell what the depended clone is unfortunately.

Marcin Karwiński 
March 28, 2022 at 8:45 AM
(edited)

FYI, while it's not an official solution, given how some SCALE users eg. suggested going for

 once in a while to "manually" clean the system of any non-needed images, and some reported some success in improving the situation just from running 

 once in a while (though if you want to combine both you'd need to first prune containers and then images, and then eg. run 

 cause volumes do not know the until filter), I've went through with docker specific but more combined/thorough solution...
I've used 

 (unfortunately if I want to clean volumes within this command I cannot use until filter as the same as for volume domain specific applies here as well) in a daily cron job (though I think I'll move it to something like a bi-weekly or monthly job after some testing period) - sure it's a workaround, it's rough around the edges, it's not allowing a combined volumes and date filters, and most importantly I think it somewhat "breaks" the application's instant rollbacking capability (seems like TrueNAS Scale's Kubernetes pods connect "previous entries" for the app with their on-filesystem/on-dataset sets of snapshots/clones that are removed upon docker prune within the previous pod versions), but I can kind of live with that as I'm testing app (re)deployments usually between 16:00 and 24:00 and have the daily prune run set for 03:00, so there's a lot of time to test and then rollback if needed before the prune runs.
Of course, one could also combine a full stack of docker <domain> prune for each subdomain - container, image, volume, network, cache - chained together and use the respective best options/switches to clean safer and better and granularlier, but I went with a simpler system-wide basic command instead. Whether you like the single though limited command or a combo of domain specific ones is up to you of course.
In effect: *
all the stopped/finished containers upon run have been removed/deleted and now the app restarts for the remaining active containers/pods have greatly improved in snappiness (back to how it was) - overall reported container count dropped from 2k+ to nearly 100...

  • thousands upon thousands of snaps, clones and filesystems/volumes have been removed along with the containers - i'm from 46.6k down to 0.6k of snaps, and in storage terms that's nearly 100GB freed...

  • hundreds of hanging/older versions of images have been deleted - i'm down from way over 2k to less than 20 now...

  • network routing to cluster and through, to and from respective containers has also improved...

  • docker build caches have dramatically reduced the storage aspect - down from over 3GB used to less than 1GB...

  • my CPU now reports under 10% on idle (consider idle as no active compute from running containers) with temps then quickly dropping at idle times to 40C - previously even on idle CPU in my server was hovering around 70% usage with temps at similar number though in degs C...

  • reboots are taking less than 20 mins (from reboot initiation to all apps and services up again, including all the hardware/bios boot stages) where before these took multiple hours...

Overall, I think I'm going to become a bit happier user with this primitive workaround until a proper smarter approach is offered in TrueNAS Scale (I'm thinking like an option to mark a container/pod as test or dev or prod one to eg. keep the containers and snapshots for debug analysis or have them pruned upon container failure or after a few hours/daily for already tested and presumably working PROD ones, or add some ephemeral option to have pods attempt to run the container and then clean-up everything docker created afterwards automatically). But since docker support's approach is basically this is by design behaviour to enable docker volumes/layers analysis on failing/broken containers (which I honestly did and still do use a lot upon building/testing my own docker images with my own docker repository) and any maintenance is to be organised elsewhere, and TrueNAS team's approach currently remains at this is docker dogmatic issue (they are right about that, it's how docker devs thought out docker, more in line with ad hoc app start/stop quickly and often and analyse and clean-up manually when needed than those running for longer periods of time or without further analysis/debugging from the customers/users) that it's not properly cleaning after itself in the long run, I think this will do for now until better solution is devised/offered in some future version of TrueNAS (as in periodic housekeeping of any trash docker/kubernetes leftovers).
I've also replaced all the aliases for 

 with 

 for any stable docker image/container for my users (to reduce stopped/finished but hanging in docker lists container counts and reduce chances of my users generating noise/trash from impromptu/ad hoc container runs), and left the regular not clean after self on fail docker run command for the very small subset of build/test deployments for debug purposes.
Hopefully my set of workarounds will help others.
Please bear in mind that this workaround clears anything docker created for non-currently running containers, so if you have some stopped app that you start only now and then you need to have it running when the prune command analyses the containers, otherwise the container/image/snaps/volumes/networks created by docker will get purged. I currently have only 2 docker images and corresponding pods that I build from scratch/source in my CI/CD pipeline for a one time single somewhat short action docker apps that are stopped most of the time, other apps are constantly running/in use, so this solution works for me. But your mileage may vary...
Perhaps in the future TrueNAS Scale team can devise one-click clean-up using a combo of all domain specific solutions with further masking eg. of the current/selected apps/pods/containers fromt the docker prune invocations.

 


EDIT:
Just FYI, this is not a complete solution, the result is still about 500 snaps remaining, some of which are the the ones from the Data protection feature - so the automated daily snaps that are automatically cleared correctly. 
There are still droves of snaps taken seemingly upon reboot/kubernetes restart that are non-removable due to snapshot has dependent clones returned message on attempting of their zfs destroy runs by hand (these are on <pool>/ix-applications/<docker>/<dockerimagelayerfilesystem>), and multiples are even removable (these are on <pool>/ix-applications/release/<appname>/<numberedversion> subdataset), both of which are snaps taken by docker/kubernetes for apps in the environment. The latter of which with their contents refering to some specific versions of the app deployments (historically) in use on the machine/server (I've had more than 50 release snaps for some of the apps deployed) - these are not in use per se, but are "recipes/snaps" for specific application deployment rollbacks as used in Installed Applications -> App -> Rollback functionality, if I recall/understand correctly. I'm not sure which docker/kubernetes mechanism manages this, but over time even only a handful of running apps will grow the number of these 2 types of snaps to over a hundred or easily over 500. Sure it's still a reduction thanks to the daily docker prune runs compared to multiples of thousands before, but these are hundreds of snaps not taken by user managed/deployed automatas, nor are they related to/pruned by aforementioned commands for cron jobs but instead these are snaps taken for the Kubernetes/docker environment by those application tools.
Perhaps in future TrueNAS Scale dev team will offer some intelligent cleaning utility to take care of these in a smart manner automatically or give an option to trim creation of these in app deployment forms. Sure, they're not taking a lot space on the storage per se thanks to the snapshotting nature, but these are still hundreds of snapshots that should be better managed by the system.

Dran Ron 
March 23, 2022 at 7:03 PM
(edited)

I have same issue.

Migrated to from TrueNAS CORE to SCALE and have 2 apps running for 2 weeks. There are already 149 snapshots of ix-applications dataset with random names. I did not find any doccumentation regarding when/how these snaphots are created, but it seems that they are made at each boot.

The way to control them would be great.

Also attaching the output of the following command:

Unresolved

Details

Priority

Assignee

Reporter

More fields

Katalon Platform

Created March 20, 2022 at 11:00 AM
Updated July 18, 2024 at 6:01 PM