Content
View differences
Updated by Marcello Rocha Pereira over 2 years ago
A problem we need to tackle:
\- A lot of job failure exceptions(due to Application password misconfiguration, for instance, the job fails with unauthorized message) cause a lot of noise. These exception cannot be solved by the application developers all the time(in case Nextcloud Application password gets stale admins of the instance must be notified automatically), but they must not be ignored for OpenProject to operate correctly with enabled storages.
Possible steps to solve the problem:
* [x]
\- Do not fail the job on a certain set of errors/cases but report them to the logs plus mark storage as unhealthy and start the process of notifying admins if needed.
* [ ]
- Unauthorized error. Most probable cause is wrong/stale Application password which should be fixed by an instance admin.
- ...
\- Introduce healthy/unhealthy property for storages.
- It can be a set of columns related to a storage which are able to answer at least the following questions:
- Is the storage healthy?
- If not healthy, then since when?
- If not healthy, then what is the problem?
- If not healthy, then have admins been notified already?
\- In case of the job badly behaviour set the state to be unhealthy and notify admin(s).
\- If something goes wrong with the job
- It is an expected issue that admins can resolve => we mark storage as unhealthy and notify admins with probable causes and steps how to fix it regularly till the storage is healthy again.
- It is not an expected issue that can or cannot be resolved by admins => we fail the job with an exception, mark it as unhealthy and anyway notify admins about the problem.
\- When an unhealthy storage gets back to be healthy?
- When the storage synchronization passes without unexpected behavior we mark it as healthy again.
- Do we need to notify admins about cured storages?
\- How to notify admins about ill storages?
- Through email
- Not on every synchronization failure, but on schedule. For instance: on first transition to unhealthy state and then once per day.
- One possible option is to have a separate cron job which once per day notifies about every unhealthy storages.
\- A lot of job failure exceptions(due to Application password misconfiguration, for instance, the job fails with unauthorized message) cause a lot of noise. These exception cannot be solved by the application developers all the time(in case Nextcloud Application password gets stale admins of the instance must be notified automatically), but they must not be ignored for OpenProject to operate correctly with enabled storages.
Possible steps to solve the problem:
* [x]
\-
* [ ]
-
- ...
- It can be a set of columns related to a storage which are able to answer at least the following questions:
- Is the storage healthy?
- If not healthy, then since when?
- If not healthy, then what is the problem?
- If not healthy, then have admins been notified already?
\- In case of the job badly behaviour set the state to be unhealthy and notify admin(s).
\- If something goes wrong with the job
- It is an expected issue that admins can resolve => we mark storage as unhealthy and notify admins with probable causes and steps how to fix it regularly till the storage is healthy again.
- It is not an expected issue that can or cannot be resolved by admins => we fail the job with an exception, mark it as unhealthy and anyway notify admins about the problem.
\- When an unhealthy storage gets back to be healthy?
- When the storage synchronization passes without unexpected behavior we mark it as healthy again.
- Do we need to notify admins about cured storages?
\-
- Through email
- Not on every synchronization failure, but on schedule. For instance: on first transition to unhealthy state and then once per day.
- One possible option is to have a separate cron job which once per day notifies about every unhealthy storages.