Implementation OP-11883: Improve performance of baseline filtering

View differences

Updated by Jens Ulferts almost 3 years ago

**Acceptance criteria** The initial implementation of the filtering for historic data would lead to poor performance on larger production data. The SQL below is an exemplary excerpt of the statements currently issued to get historic data. It would be issued if an administrator wants to compare the current work packages with the state of a month back without changing the default filter ("Status open"):

```SQL
SELECT
'2023-01-06 07:57:14 UTC' as timestamp,
work_package_journals.*,
journals.journable_id as id,
journables.created_at as created_at,
journals.updated_at as updated_at
FROM
( SELECT distinct on (journable_id) * Add FROM "journals"
WHERE "journals"."created_at" <= '2023-01-06 07:57:14.153220'
AND journals.journable_type = 'WorkPackage'
ORDER BY "journals"."journable_id" ASC,
"journals"."created_at" DESC
) journals,
"work_package_journals",
(SELECT id, created_at FROM "work_packages") AS journables
WHERE
"journables"."id" = "journals"."journable_id"
AND "journals"."data_id" = "work_package_journals"."id"
AND "project_id" IN (
SELECT "projects".id
FROM "projects"
INNER JOIN "enabled_modules"
ON "projects"."id" = "enabled_modules"."project_id"
AND "enabled_modules"."name" IN ('work_package_tracking')
AND "projects"."active" = TRUE
WHERE "projects"."active" = TRUE)
AND "status_id" IN (SELECT id FROM statuses WHERE NOT is_closed)
```

On a `tstzrange` column ~~larger database the statement will take over a minute~~ to execute. The performance improves when applying a limit and ordering by journable\_id but decreases again when using a different column.

Attempts to improve the performance by adding indices failed.

The statement suffers from two performance hotspots:

* Calculating the journals table.
~~relevant for the filtering (most recent journal before the timestamp queried for) done in the `journals` subselect.~~
* The lower bound Joining the `journals` with the other tables: `work_package_journals` and `journables`

Both of ~~the hotspots can be addressed by creating a materialized view~~ that column is ~~prepares~~ the created\_at value journals to be read more easily:

```SQL
CREATE MATERIALIZED VIEW
work_package_journals_prepared
AS
SELECT
journals.journable_id as id,
journals.updated_at as updated_at,
journals.active_range,
journals.notes,
journals.end_at,
journals.created_at as created_at,
journables.created_at as first_created_at,
work_package_journals.type_id,
work_package_journals.project_id,
work_package_journals.subject,
work_package_journals.description,
work_package_journals.due_date,
work_package_journals.category_id,
work_package_journals.status_id,
work_package_journals.assigned_to_id,
work_package_journals.priority_id,
work_package_journals.version_id,
work_package_journals.author_id,
work_package_journals.done_ratio,
work_package_journals.estimated_hours,
work_package_journals.start_date,
work_package_journals.parent_id,
work_package_journals.responsible_id,
work_package_journals.budget_id,
work_package_journals.story_points,
work_package_journals.remaining_hours,
work_package_journals.derived_estimated_hours,
work_package_journals.schedule_manually,
work_package_journals.duration,
work_package_journals.ignore_non_working_days
FROM
journals,
"work_package_journals",
(SELECT id, created_at FROM "work_packages") AS journables
WHERE
"journables"."id" = "journals"."journable_id"
AND
"journals"."data_id" = "work_package_journals"."id"
AND
"journals"."data_type" = 'Journal::WorkPackageJournal'
AND
journals.journable_type = 'WorkPackage'
```

Using the excerpt above, the materialized view would only contain the work package journals up to a year ago. That includes journals created within that period but also journals which were replaced within that period. The excerpt makes use of ~~an amendment to~~ the ~~`journals` table `active_range` including the `created_at` date of the~~ journal
* The upper ~~as its lower~~ bound of that colum is ~~and~~ the created\_at value ~~`created_at` date~~ of the journal's successor.
* That value can thus only be written once the ~~journals~~ successor is created.
* ~~as its upper boundary (not included).~~ If there is no successor (yet), ~~exists,~~ the upper bound will ~~boundary would~~ be empty.
* Adapt ~~NULL. Adding that column would need changes to~~ the timestamp filtering ~~journal creating sql but might be beneficial in other scenarios as well. If the column is not added, the SQL for the materialized view would need~~ to use ~~be altered creating~~ the `tstzrange`
* (If possible) Adapt view would be possible nonetheless.

Indices can be added on that materialized view optimizing performance.

Employing the work package query results to no longer have ~~view, the filtering statement from above can be simplified~~ a whole subquery lot:

```SQL
SELECT
'2023-01-23 07:57:14 UTC' as timestamp,
wp_journals.*

FROM "work_package_journals_prepared" wp_journals
WHERE wp_journals.active_range @> timestamp with all filter criteria time zone '2023-01-23 07:57:14 UTC'
AND project_id IN (SELECT id FROM "projects" WHERE "projects"."id"= project_id AND "projects"."active" = TRUE)
AND status_id IN (SELECT id from statuses WHERE NOT is_closed)
```

The performance also improves drastically but rather have only is still not optimal. It takes about 2 seconds to execute.

If order & limit are applied to statements, the performance of both improve. The initial statement takes about 2.5 s and the one subquery comparing based on the view 100ms.

This improvement comes at all ~~a cost as~~ the relevant timestamps e.g `"journals"."validity_period" **@>** timestamp **with** time **zone** '2020-05-12 04:00:00' **OR** "journals".validity_period **@>** timestamp **with** time **zone** '2023-06-07 08:52:05.515455'`
* Whether this ~~materialized view needs to be maintained. Refreshing the view on the used dataset takes about 6 minutes (where the `active_range`~~ is ~~already calculated). Because of this, it will not be~~ possible will need to be double checked and depends ~~trigger refreshing the view~~ on how ~~every update to~~ the individual filters are written. journals/work\_package/work\_package\_journals tables. But this would presumably not be necessary. Instead the update could be done in regular intervals e.g. nightly or twice a day. The baseline comparison feature would only have to raise an error or notification when filtering for a time frame the materialized view does not cover.

Back

Content

View differences