Click her to see the next.
Solve PostgreSQL DataFileRead and buffer_io with Parameter Tuning | by monika yadav | Medium
This is an epic case study to solve DataFileRead and buffer_io without changing disk hardware, iops and without creating any new index!
Problem Walk Through:
You can see from the image attached above that our average active sessions(AAS) were going upto 60 which is a very big number. An active session is a database connection that has submitted a request to the database but has not yet received the response. It means that in our case on an average 30 to 60 sessions were waiting for the response from the database server.
In Performance Insight, in database load chart legend in the top right corner there is an option to select chart type (bar / line) and slice by(application / database / host / session_type / user /sql / waits). We have sliced the chart by waits with bar type which means that the bar line will be showing % of time spent by different sessions in particular postgreSQL wait events.
In our case major wait events as can been seen were:
1) DataFileRead: This event occurs when a connection waits on a backend process to read a required page from storage because the page isn’t available in shared memory.
2) buffer_io: This event occurs when lots of connections wait for pages to be read into the shared buffers pool. As soon as pages are read into the shared buffer pool the wait event is released.
If we combine both issues it means our shared buffer isn’t sufficient for our current workload and leading to buffer_io wait and DataFileRead to read pages into shared buffers. We picked the queries that were showing 95% of wait on both events.
Possible Demons:
The possible demons to lead us into current situations could be one or multiple of below:
- Queries doing sequential scans of tables and causing havoc on the database. But in our case the problematic query wasn’t doing the sequential scans.
- Queries might be using index but index definition isn’t the best fit. In our case the index definition and usage was good.
- Shared_buffer isn’t sufficient and PostgreSQL needs more shared buffers or may be more overall memory allocation.
- Heavy write operations/ongoing index activity/vacuum activity which would have slowed down. Fortunately this was also not the case with us.
- High number of read operations which would have aggravated the situation. This was also not the case with us.
Problem Root Cause:
So what was the root cause in our case? What do you think?
?
?
?
It was smaller shared_buffers.
Possible Solutions:
- Increasing shared buffers size
- Optimising shared buffers content.
Shared buffers size can be increased by:
- Increasing shared_buffers parameter value involves downtime
- Upscaling RDS instance size involves both downtime and increased cost.
I started looking into optimising shared buffers.
I picked and analyzed the query with highest no of wait on buffer_io and DataFileRead. Below is explain plan of query.
You will see 99.98% of time spent by query is in index scan. Also actual time=43.760..6346.829 means that it took 43 ms before the index found the first result. Also shared buffer hit ratio is only 20%, formula to calculate shared buffer hit ratio is : (shared hit*100)/(shared hit+ shared read).I suspected index bloating as the issue here.
After analysing the index bloat of the indexes associated with problematic queries, I figured out that our indexes were 50% bloated. I have added references at the bottom of page to find bloating percentage in your indexes.
I first reindexed the indexes used in our problematic query and it improved the performance. I also reindexed the other bloated indexes and after that here is query execution plan:
Permanent Fix:
High index bloat is caused by insufficient autovacuum setting. It means that volume of our write operations were very high and autovacuum wasn’t running frequently enough to avoid increased bloating. To fix our issue permanently, I made autovacuum settings more aggressive.
Default autovacuum settings were:
- Autovacuum_vacuum_scale_factor: 0.1
- autovacuum_analyze_scale_factor: 0.05
- autovacuum_vacuum_cost_delay: 20ms
New customised autovacuum settings are:
- autovacuum_analyze_scale_factor = 0.005
- autovacuum_vacuum_scale_factor=0.01
- autovacuum_vacuum_cost_delay: 30ms
How to change autovacuum settings:
Above settings can be changed system wide from associated parameter group in RDS and postgreSQL configuration file in on-prem / vm installation.
I wanted to change this setting only specific to a table so I went ahead with alter table command:
Set lock_timeout=’300ms’;
ALTER TABLE table_name SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_analyze_scale_factor = 0.005,autovacuum_vacuum_cost_delay=’30ms’);
By fixing bloating and tuning autovacuum settings, we reduced the wait event count from 30–60 to less than 2,
I will be making a very detailed blog post on tuning autovacuum settings for specific needs in my next blog post, till then Happy Learning!
What Problems We Solved:
- Improved application performance by 97%.
- Saved business cost by $100 per month.
- It was done with zero downtime, we saved business downtime.
References: