Keywords:
Context:
This project is situated in a near-future condition in which biometric authentication has become normalized infrastructure. Voice assistants, facial recognition, and fingerprint systems increasingly bind access, ownership, and decision-making authority to individual biological identity.
The work explores an alternative model:
authentication as collective (and even social) presence rather than individual verification.
Partly inspired by Project CyberSyn in Chile -> a distributed decision support system to aid the national economy (economic simulator, check factory performance, operation room, network). The similarity of my idea ends at the distributed part.
But if we imagine this group biometric policy being implemented in a larger scale, it can mean:
This distributes risk of biometric abuse:
Real world implementation (so far) (although it is not audio-based):
Where is the audio based examples? Where can listening machine step in here?
To be honest…
Collective biometric unlocking is uncommon because:
But…
Collective voice authentication becomes relevant when authorization itself is intended to be social rather than individual. The question is less technical feasibility and more whether a system benefits from distributed consent.
Voice has social advantages, can be:
So why listening machine?
The machine contains the algorithm to detect the cipher, while both the individual carries 2-part cipher (each carries one). We need the listening machine to transcribe what the individuals are saying, whether it matches the stored cipher, and then unlock the data.
Why do people need to speak here?
Speech becomes a performed consent for data access.
Content:
Concept:
The biometric idea is provoked from reading “The Native Ear” by Michelle Pfeiffer. Although the project won’t be touching the subject of immigrants, it questions the use of voice-based identification to acquire data access. Similar to the voice as the passport, the voice biometric here becomes a key to unlock some access/privilege.
Issues
Seems very challenging to be able to detect 2 speakers in 1 recording that are speaking synchronously. Proposed solution will be having them speak to their device at the same time. Each person will have independent device. Record the timestamp of the recording and if it is falling in the same threshold, allow it to process the audio. If not, return early.
Question: what is the best way to detect that person A/B/Others is speaking and *also* saying xxx word?
Possibility of training strategy
A.
The difficulty is that “everyone else” is not a real class.
You cannot directly train a model on infinite unknown speakers.
So the correct formulation is verification + rejection, not ordinary 3-class classification.
The model may only work on the trained random people data set -> open-set recognition failure
B.
Class 1: You
Class 2: Person B
Use embeddings
Audio → Speaker Encoder (speechbrain on py) → Embedding (vector)
Then use the embedding distance score to determine whether it is closely resembling your voice, or B, or some weirdo
Do I need Negative Samples?
Without negatives, the model may trigger constantly?
Method

UI Overview

Elizabeth Kezia Widjaja © 2026 🙂