Implementation OP-11883: Improve performance of baseline filtering

View differences

Updated by Christophe Bliard about 3 years ago

The initial implementation of the filtering for historic data would lead to poor performance on larger production data. The SQL below is an exemplary excerpt of the statements currently issued to get historic data. It would be issued if an administrator wants to compare the current work packages with the state of a month back without changing the default filter ("Status open"):

```SQL
SELECT
'2023-01-06 07:57:14 UTC' as timestamp,
work_package_journals.*,
journals.journable_id as id,
journables.created_at as created_at,
journals.updated_at as updated_at
FROM
( SELECT
~~(SELECT~~ distinct on (journable_id) * FROM "journals"

WHERE "journals"."created_at" <= '2023-01-06 07:57:14.153220'
AND journals.journable_type = 'WorkPackage'

ORDER BY "journals"."journable_id" ASC,
"journals"."created_at" DESC
) ~~DESC)~~ journals,

"work_package_journals",

(SELECT id, created_at FROM "work_packages") AS journables

WHERE
"journables"."id" = "journals"."journable_id"
AND "journals"."data_id" = "work_package_journals"."id"
AND "project_id" IN (
SELECT ~~(SELECT~~ "projects".id
FROM "projects"
INNER JOIN "enabled_modules"
ON "projects"."id" = "enabled_modules"."project_id"
AND "enabled_modules"."name" IN ('work_package_tracking')
AND "projects"."active" = TRUE
WHERE "projects"."active" = TRUE)
AND "status_id" IN (SELECT id FROM statuses WHERE NOT is_closed)
```

On a larger database the statement will take over a minute to execute. The performance improves when applying a limit and ordering by journable\_id but decreases again when using a different column.

Attempts to improve the performance by adding indices failed.

The statement suffers from two performance hotspots:

* Calculating the journals relevant for the filtering (most recent journal before the timestamp queried for) done in the `journals` subselect.
* Joining the `journals` with the other tables: `work_package_journals` and `journables`

Both of the hotspots can be addressed by creating a materialized view that prepares the journals to be read more easily:

```SQL
CREATE MATERIALIZED VIEW
work_package_journals_prepared
AS
SELECT
journals.journable_id as id,
journals.updated_at as updated_at,
journals.active_range,
journals.notes,
journals.end_at,
journals.created_at as created_at,
journables.created_at as first_created_at,
work_package_journals.type_id,
work_package_journals.project_id,
work_package_journals.subject,
work_package_journals.description,
work_package_journals.due_date,
work_package_journals.category_id,
work_package_journals.status_id,
work_package_journals.assigned_to_id,
work_package_journals.priority_id,
work_package_journals.version_id,
work_package_journals.author_id,
work_package_journals.done_ratio,
work_package_journals.estimated_hours,
work_package_journals.start_date,
work_package_journals.parent_id,
work_package_journals.responsible_id,
work_package_journals.budget_id,
work_package_journals.story_points,
work_package_journals.remaining_hours,
work_package_journals.derived_estimated_hours,
work_package_journals.schedule_manually,
work_package_journals.duration,
work_package_journals.ignore_non_working_days
FROM
journals,
"work_package_journals",
(SELECT id, created_at FROM "work_packages") AS journables
WHERE
"journables"."id" = "journals"."journable_id"
AND
"journals"."data_id" = "work_package_journals"."id"
AND
"journals"."data_type" = 'Journal::WorkPackageJournal'
AND
journals.journable_type = 'WorkPackage'
```

Using the excerpt above, the materialized view would only contain the work package journals up to a year ago. That includes journals created within that period but also journals which were replaced within that period. The excerpt makes use of an amendment to the `journals` table `active_range` including the `created_at` date of the journal as its lower bound and the `created_at` date of the journals successor as its upper boundary (not included). If no successor exists, the upper boundary would be NULL. Adding that column would need changes to the journal creating sql but might be beneficial in other scenarios as well. If the column is not added, the SQL for the materialized view would need to be altered creating the view would be possible nonetheless. ~~nonetheless.~~

Indices can be added on that materialized view optimizing performance.

Employing the view, the filtering statement from above can be simplified a lot:

```SQL
SELECT
'2023-01-23 07:57:14 UTC' as timestamp,
wp_journals.*

FROM "work_package_journals_prepared" wp_journals
WHERE wp_journals.active_range @> timestamp with time zone '2023-01-23 07:57:14 UTC'
AND project_id IN (SELECT id FROM "projects" WHERE "projects"."id"= project_id AND "projects"."active" = TRUE)
AND status_id IN (SELECT id from statuses WHERE NOT is_closed)

```

The performance also improves drastically but is still not optimal. It takes about 2 seconds to execute.

If order & limit are applied to statements, the performance of both improve. The initial statement takes about 2.5 s and the one based on the view 100ms.

This improvement comes at a cost as the materialized view needs to be maintained. Refreshing the view on the used dataset takes about 6 minutes (where the `active_range` is already calculated). Because of this, it will not be possible to trigger refreshing the view on every update to the journals/work\_package/work\_package\_journals tables. But this would presumably not be necessary. Instead the update could be done in regular intervals e.g. nightly or twice a day. The baseline comparison feature would only have to raise an error or notification when filtering for a time frame the materialized view does not cover.

Back

Content

View differences