Random invalid session and inconsistent service accounts #19510

pschichtel · 2024-04-15T17:03:21Z

This is a follow up to #19217 and #19201

After my vacation I just verified the state of the minio installation again after the previous issues.

Expected Behavior

Once logged in I'd expect not to randomly receive "invalid session" warnings or to get randomly logged out when navigating to certain pages (e.g. the Site Replication config page).

I would also expect to the same service accounts on my root user every time I refresh the Access Keys page (or when directly accessing /api/v1/service-accounts).

Current Behavior

I randomly get invalid session responses ("The Access Key Id you provided does not exist in our records.") from the backend and on some pages, that leads to a redirect to the login page.

I also get a different list service accounts every time I refresh, sometimes it doesn't even include the site-replicator-0 account, which would explain why I'm still seeing #19217. Actually in my tests now by refreshing /api/v1/service-accounts a bunch of times, I rarely get all 4 service accounts.

The backup site still occasionally logs this as in #19217:

minio-1  | API: SRPeerBucketOps(bucket=154a22a1-8dca-4e64-98d8-687376a04d32)
minio-1  | Time: 16:58:03 UTC 04/15/2024
minio-1  | DeploymentID: bc54da3b-88f4-4a0d-a9d4-2365bf5a0d80
minio-1  | RequestID: 17C68294B9A6D50A
minio-1  | RemoteHost: 
minio-1  | Host: 
minio-1  | UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
minio-1  | Error: Site replication error(s): 
minio-1  | 'ConfigureReplication' on site Production (bf3123cf-9753-4ea4-a46f-535599899c4c): failed(Backup->Production: Bucket target creation error: Remote service endpoint offline, target bucket: 154a22a1-8dca-4e64-98d8-687376a04d32 or remote service credentials: site-replicator-0 invalid 
minio-1  | 	The Access Key Id you provided does not exist in our records.) (*errors.errorString)
minio-1  |        4: internal/logger/logger.go:259:logger.LogIf()
minio-1  |        3: cmd/logging.go:30:cmd.adminLogIf()
minio-1  |        2: cmd/admin-handlers-site-replication.go:142:cmd.adminAPIHandlers.SRPeerBucketOps()
minio-1  |        1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()

Steps to Reproduce (for bugs)

I'm still not sure how I arrived at this state, I assume by enabling site replication.

I've checked that KES is working on both the production and the backup site. At this point I'm not even able to disable site replication on the production site, because I get constantly logged out (redirected to login page) from the page.

The single-node backup instance does not observe this behavior. there, I never get invalid session responses, I always get the same 4 service accounts on the root user (including site-replicator-0) and I can also access the Site Replication page.

Context

It makes using the minio console difficult. I assume, replication from backup to production would not reliably work (or be a lot slower), but that's not currently something I need to do.

Interestingly mcli admin user svcacct list production admin always returns the complete list of service accounts for my root user, although not always in the same order, but that doesn't matter. S3 clients in general don't seem to be affected, at least not functionally.

To elaborate on the setup:

2 sites:

site (production): 5 nodes, each with 1 disk, deployed via minio-operator to k8s, kes configured against a vault running in the same k8s
site (backup): 1 node with 1 disk, deployed via docker-compose, kes configured with filesystem, containing the necessary keys from vault (to decouple the backup site from the k8s).

The keys between the KES deployments are identical (replicated files from production site can be decrypted on backup site. The production KES setup is responsive and can successfully access the vault (I created and deleted a test key to confirm).

Your Environment

Version used (minio --version): RELEASE.2024-04-06T05-26-02Z
Server setup and configuration: deployed by operator (5.0.14), replicating to a single-node setup on the same version deployed with docker-compose.
Operating System and version (uname -a): Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

jiuker · 2024-04-16T08:12:42Z

@pschichtel Do you only have one KES server for production?

pschichtel · 2024-04-16T08:13:34Z

@jiuker production has 3, backup has 1

jiuker · 2024-04-16T08:16:24Z

@jiuker production has 3, backup has 1

Does the 3 KES server have the same keys for production? @pschichtel

pschichtel · 2024-04-16T08:18:27Z

They are all connected to the same vault (with a dedicated V2 KV engine for minio), so I'd assume so. How can I check?

jiuker · 2024-04-16T08:20:28Z

Could you check the key site-replicator-0's value have changed always? @pschichtel

pschichtel · 2024-04-16T08:21:59Z

Not sure what you mean

jiuker · 2024-04-16T08:23:53Z

Not sure what you mean

Check for overlapping value assignments between two clients

pschichtel · 2024-04-16T08:35:51Z

Sorry for being confused!

Could you check the key site-replicator-0's value have changed always?

By key, do you mean a KES key or an access key/secret access key? There is no "site-replicator-0" KES key, so I assume access key. What do you mean by "value" then?

Check for overlapping value assignments between two clients

What do you mean by "value assignments"? And what clients?

I just checked with mcli again (mcli admin user svcacct info production site-replicator-0) and every now and then I get mcli: <ERROR> Unable to get information of the specified service account. The specified service account is not found (Specified service account does not exist)., so there must be an instance that doesn't have the account I guess. Didn't notice it yesterday weirdly. If it doesn't fail with said error, then it returns this consistently:

AccessKey: site-replicator-0
ParentUser: root-user
Status: on
Name: 
Description: 
Policy: implied
Expiration: no-expiry

jiuker · 2024-04-16T11:03:20Z

I only found a strange case where when I open two minio login pages in one browser at the same time, there have one said that {"detailedMessage":"Access Denied.","message":"invalid session"} , but I can access it normally with two browsers. I don't know if it's similar to your case? @pschichtel

pschichtel · 2024-04-16T12:07:33Z

@jiuker I have the occasional case, where the page after login stays blank, because /api/v1/session returns invalid session / The Access Key Id you provided does not exist in our records.. I don't need a two tabs/browsers for that.

I think you case sounds like a race condition on the shared cookies/localStorage/sessionStorage between the browser tabs, that are not shared between browsers.

jiuker · 2024-04-16T14:00:43Z

@jiuker I have the occasional case, where the page after login stays blank, because /api/v1/session returns invalid session / The Access Key Id you provided does not exist in our records.. I don't need a two tabs/browsers for that.

I think you case sounds like a race condition on the shared cookies/localStorage/sessionStorage between the browser tabs, that are not shared between browsers.

Yeah. Will return back to login page for /api/v1/session return that {"detailedMessage":"Access Denied.","message":"invalid session"} when fresh the page. It looks like your case. So I guess that could be a console issue. What's the page you view when that happen. @pschichtel

pschichtel · 2024-04-16T14:04:31Z

@jiuker I don't think it is limited to a specific page, I've seen it happen on several different pages.

So I guess that could be a console issue.

I'm not so sure anymore, because I get errors with mcli too, that doesn't go through the console, right?

harshavardhana · 2024-04-26T20:00:03Z

We can't reproduce any of the issues reported here.

pschichtel · 2024-04-26T20:02:10Z

How can I properly clear replication settings from both sites? then I could test the production cluster without site replication and see if that helps.

poornas · 2024-04-26T20:04:45Z

 $ mc admin replicate remove sitea siteb  --force

pschichtel · 2024-04-26T20:06:54Z

I just noticed, even the replication rules on buckets are completely inconsistent from refresh to refresh.

@poornas thanks, I'll try that next week.

pschichtel · 2024-04-26T20:12:48Z

Bucket versioning is also affected. it seems like everything somehow related to site replication is completely inconsistent between the nodes of the production cluster. It also seems to have gotten worse since I last checked last week.

pschichtel · 2024-04-29T10:23:06Z

@poornas I removed the backup site from replication and it's all fine now. Should the site-replicator-0 account disappear? Or should I clean that account up before re-enabling replication?

jiuker · 2024-04-29T10:28:02Z

Remove site-replication and site-replicator-0 account disappear. You cannot remove that account, it's interval account.

pschichtel · 2024-04-29T10:32:13Z

Are you saying it should automatically disappear after removing site-replication? Because it hasn't so far, neither in the production site nor in the backup site. Both sites don't have any other replication rules.

So I'll delete the service accounts to have a clean state.

jiuker · 2024-04-29T10:41:10Z

Are you saying it should automatically disappear after removing site-replication? Because it hasn't so far, neither in the production site nor in the backup site. Both sites don't have any other replication rules.

So I'll delete the service accounts to have a clean state.

Yeah. It should disappear. If not, you can try delete it for I didn't reproduce your case.

pschichtel · 2024-04-29T10:44:21Z

I removed the accounts, I'll upgrade both instances to latest now and then setup replication again in the evening.

pschichtel · 2024-04-29T11:54:43Z

I remember @harshavardhana saying something about this in a past issue: The /v1/service-accounts endpoint is rather slow (400-900 ms "wait" time in the browser) given that this is a small cluster (5 nodes) and only 3 service accounts exist and my connection is basically local, this feels noticeable slow in the UI. This is still the case even after disabling replication. Is the timing within a normal range or would this be worth investigating? I originally thought this is caused by the replication problem, but apparently it isn't.

pschichtel · 2024-05-06T16:40:35Z

@harshavardhana now after the upgrade to RELEASE.2024-05-01T01-11-10Z I once again have the issue that service accounts seem to be inconsistent between nodes (/api/v1/service-accounts reporting different sets). However this time I don't have any "invalid session" issues so far. At least the site-replicator-0 service account is consistently contained in the response of the service accounts api and so far the backup site is not complaining either.

I wonder if this is something caused by the upgrade process of the operator?

Should I open a new issue for this?

pschichtel · 2024-05-06T16:47:09Z

At least it seems to be limited to the console API, running mcli admin user svcacct list production root continuously in a loop hasn't shown any differences except for order.

pschichtel · 2024-05-10T15:05:49Z

@harshavardhana after upgrading to RELEASE.2024-05-10T01-41-38Z today the console is once again barely usable due to random invalid session and inconsistent service accounts. S3 seems unaffected so far (site replication to backup is unaffected and mcli admin user svcacct list production root has been showing consistent results for a while now.)

I'm now convinced that that the upgrade as performed by the operator seems to cause or at least worsen this issue.

Should I create a new issue? possibly over at minio/console ?

harshavardhana · 2024-05-10T15:13:24Z

This generally I don't it to occur unless someone actively wipes your credentials.

harshavardhana · 2024-05-10T15:14:20Z

Can you collect both sites all their backend .minio.sys folder and share it with us ?

pschichtel · 2024-05-10T17:19:57Z

@harshavardhana Seems there is quite a bit of information in there that I don't think I can just share. Is it possible to limit the requested files? Otherwise I'd first have to clear internally if it's ok to share this stuff.

What I noticed while poking around:

deleting a service account via minio client works and is correctly reflected in the /service-account response (in those cases that don't fail with "invalid session")
it doesn't matter which of the nodes receive the request, they all randomly fail with invalid session. (I assume they internally redistribute the requests?)
the service account files in .minio.sys seem reasonably similar
none of the service accounts have expiry dates, but in the /service-accounts endpoint some say "expiration": "0001-01-01T00:00:00Z" and others say "expiration": "1970-01-01T00:00:00Z". not sure what that is about.
the /service-accounts endpoint is currently all or nothing: either it shows invalid session or it shows all service accounts. I'd say ~2/3 of the time it fails with invalid session.

pschichtel · 2024-05-10T18:14:32Z

Ha... I found the offender. I slowly, one-by-one went through the pods (from last to first similar to the sts controller), deleted them and let the sts controller recreate them. between each pod I repeatedly checked the the /service-accounts endpoint. pods 4, 3 and 2 did nothing, restarting the pod 1 completely resolved the issue.

harshavardhana · 2024-05-10T18:50:28Z

Ha... I found the offender. I slowly, one-by-one went through the pods (from last to first similar to the sts controller), deleted them and let the sts controller recreate them. between each pod I repeatedly checked the the /service-accounts endpoint. pods 4, 3 and 2 did nothing, restarting the pod 1 completely resolved the issue.

Do you have logs from this pod before deleting it ?

pschichtel · 2024-05-10T19:23:15Z

I do, but I don't think there was anything of interest. I'll check...

pschichtel · 2024-05-10T20:38:48Z

here you go: production-1-logs.txt

I noticed that the cluster once lost the quorum. The log file btw includes both the update and my restart the fixed the problem.

pschichtel · 2024-05-29T20:51:24Z

Minor update on this: Apparently this happened again at some point in the last 3 weeks. Sadly this time I don't know which node was affected, but the upgrade to yesterdays's release resolved the issue for now (probably because all nodes got restarted). I'll keep monitoring.

pschichtel · 2024-06-06T17:48:36Z

I once again had this after the upgrade to the newest release. Restarting the nodes one by one once again solved it.

harshavardhana · 2024-06-10T22:46:32Z

I suspect this is due to #19905 and has been around since 03-28 release

pschichtel · 2024-06-11T07:39:45Z

Timing-wise that sounds reasonable. Will the fix help immediately with the next upgrade or will it require me to cycle the deployment one more time?

pschichtel · 2024-06-11T08:08:52Z

Just upgraded to RELEASE.2024-06-11T03-13-30Z and immediately after the deployment has converged I was practically unable to login until I cycled the 4th node. after cycling additional nodes afterwards the situation got worse for a moment for now I once again have a stable with no failing requests in the console. I'll report back to you if it stays this way.

klauspost · 2024-06-11T09:50:59Z

@pschichtel Don't do rolling upgrades.

pschichtel · 2024-06-11T10:02:37Z

@klauspost how am I supposed to do upgrades with the operator?

pschichtel · 2024-06-11T15:40:23Z

Is there any documentation on the topic of upgrades in kubernetes with the operator at all? The operator seems to just do a StatefulSet update which in turn triggers a rolling upgrade last to first (that's k8s' default strategy iirc). The main documentation on upgrades doesn't seem appliable to a) containerized deployments and b) kubernetes deployments, given that overwriting a binary in a container isn't really useful, when the overwrite is lost after restarting the process.

Also does this apply only to rolling upgrades or to rolling restarts in general ? because I'm pretty sure I triggered this issue with just a rolling restart on the exact same version.

harshavardhana · 2024-06-11T15:47:33Z

Operator upgrades first the binary inside the container via pushing the binary directly to the pods, once the pod is restarted() it proceeds to updating the container image which does rolling.

If the first step fails, I suspect the operator is still doing a statefulset update.

harshavardhana · 2024-06-11T15:47:45Z

So your operator logs will have some information regarding this.

pschichtel · 2024-06-11T15:50:50Z

I'll can check that

pschichtel · 2024-06-12T08:44:19Z

I've seen this line both in this production multi-node setup as well as in a homelab deployment also using the operator:

E0611 ...       1 main-controller.go:1500] error syncing 'minio/tenant': Tar file extraction failed for file index: 2, with: EOF

harshavardhana · 2024-06-12T08:46:44Z

I've seen this line both in this production multi-node setup as well as in a homelab deployment also using the operator:
E0611 ...       1 main-controller.go:1500] error syncing 'minio/tenant': Tar file extraction failed for file index: 2, with: EOF

yeah this means upgrade would have failed.

harshavardhana · 2024-06-12T08:46:56Z

this is the exact reason why you face this problem.

pschichtel · 2024-06-12T10:05:02Z

I'd argue the operator would better stop and enter an error state, when a correct upgrade is not possible.

How can we find the cause of the Tar file extraction failed for file index: 2, with: EOF error? As I said: It happens in completely independent setups with different networks and different host systems (different hardware, different kernel, different distro), however they are both in k0s and they both use a similarly configured tenant created through the helm chart, the values:

tenant:
  image:
    repository: quay.io/minio/minio
    tag: 'RELEASE.2024-06-11T03-13-30Z'
  name: tenant
  configuration:
    name: credentials
  pools:
  - name: main
    servers: 5
    volumesPerServer: 1
    storageClassName: ''
    size: 123123123
    labels:
      velero.io/exclude-from-backup: "true"
  metrics:
    enabled: true
  certificate:
    requestAutoCert: true
  env:
  - name: MINIO_OPERATOR_TLS_ENABLE
    value: "off"
  - name: MINIO_DOMAIN
    value: "minio.example.org"
  - name: MINIO_BROWSER_REDIRECT_URL
    value: "https://console.minio.example.org"
  - name: MINIO_SERVER_URL
    value: "https://minio.example.org"
  log:
    disabled: true
  prometheus:
    disabled: true
  prometheusOperator: true

pschichtel · 2024-06-12T17:41:49Z

also there no NetworkPolicy and also no outbound firewall in general, so there is no reason why the operator shouldn't be able to download and distribute the minio binaries.

pschichtel added community triage labels Apr 15, 2024

harshavardhana assigned donatello and poornas Apr 15, 2024

harshavardhana added priority: medium needs-investigation and removed triage labels Apr 15, 2024

harshavardhana added the fixed in latest release this issue is already fixed and upgrade is recommended label May 4, 2024

harshavardhana reopened this May 10, 2024

harshavardhana added waiting for info and removed fixed in latest release this issue is already fixed and upgrade is recommended labels May 10, 2024

Random invalid session and inconsistent service accounts #19510

Random invalid session and inconsistent service accounts #19510

Comments

pschichtel commented Apr 15, 2024 • edited

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

jiuker commented Apr 16, 2024

pschichtel commented Apr 16, 2024

jiuker commented Apr 16, 2024

pschichtel commented Apr 16, 2024

jiuker commented Apr 16, 2024

pschichtel commented Apr 16, 2024

jiuker commented Apr 16, 2024

pschichtel commented Apr 16, 2024

jiuker commented Apr 16, 2024 • edited

pschichtel commented Apr 16, 2024 • edited

jiuker commented Apr 16, 2024

pschichtel commented Apr 16, 2024

harshavardhana commented Apr 26, 2024

pschichtel commented Apr 26, 2024 • edited

poornas commented Apr 26, 2024

pschichtel commented Apr 26, 2024

pschichtel commented Apr 26, 2024

pschichtel commented Apr 29, 2024

jiuker commented Apr 29, 2024 • edited

pschichtel commented Apr 29, 2024

jiuker commented Apr 29, 2024

pschichtel commented Apr 29, 2024

pschichtel commented Apr 29, 2024

pschichtel commented May 6, 2024

pschichtel commented May 6, 2024

pschichtel commented May 10, 2024 • edited

harshavardhana commented May 10, 2024

harshavardhana commented May 10, 2024

pschichtel commented May 10, 2024

pschichtel commented May 10, 2024

harshavardhana commented May 10, 2024

pschichtel commented May 10, 2024

pschichtel commented May 10, 2024

pschichtel commented May 29, 2024 • edited

pschichtel commented Jun 6, 2024

harshavardhana commented Jun 10, 2024

pschichtel commented Jun 11, 2024 • edited

pschichtel commented Jun 11, 2024

klauspost commented Jun 11, 2024

pschichtel commented Jun 11, 2024

pschichtel commented Jun 11, 2024 • edited

harshavardhana commented Jun 11, 2024

harshavardhana commented Jun 11, 2024

pschichtel commented Jun 11, 2024

pschichtel commented Jun 12, 2024

harshavardhana commented Jun 12, 2024

harshavardhana commented Jun 12, 2024

pschichtel commented Jun 12, 2024

pschichtel commented Jun 12, 2024

pschichtel commented Apr 15, 2024 •

edited

jiuker commented Apr 16, 2024 •

edited

pschichtel commented Apr 16, 2024 •

edited

pschichtel commented Apr 26, 2024 •

edited

jiuker commented Apr 29, 2024 •

edited

pschichtel commented May 10, 2024 •

edited

pschichtel commented May 29, 2024 •

edited

pschichtel commented Jun 11, 2024 •

edited

pschichtel commented Jun 11, 2024 •

edited