1. The four data categories
Every piece of data that feeds the matching algorithm comes from one of four categories. Each category has different characteristics - how the data is collected, how rich it is, and what it can tell us about compatibility. Understanding these categories is essential to understanding why two users get the score they do.
Behavioural data pulled from platforms the user already uses - Spotify, Steam, GitHub, Untappd, Letterboxd, Strava, Goodreads, and more. This is the richest data category because it reflects what people actually do, not what they say they do.
How it works: The user connects a platform via OAuth. An integration adapter pulls permitted data (listening history, game library, repos, check-ins, ratings) and maps it into the normalised data layer. The raw API data is stored privately; only derived signals are used for matching and shown to other users.
Characteristics:
- Richest signal quality. External data can provide all three Affinity dimensions - presence, engagement, and sentiment - making it Tier 1 or Tier 2 in the data tier model.
- Passive collection. No effort required from the user beyond connecting. Their existing behaviour is the data.
- Platform-dependent. Different APIs expose different fields. Spotify gives popularity scores directly; Steam does not. Letterboxd has star ratings; GitHub does not. The normalised data layer must handle these asymmetries.
- Multiple providers per category. Music data might come from Spotify, Apple Music, Last.fm, SoundCloud, or Bandcamp. The matching engine operates at the category level, not the platform level - which creates the standardisation challenges documented below.
Example categories and integrations:
| Category | Integrations | Data signals |
|---|---|---|
| ♫ Music | Spotify, Apple Music, Last.fm, SoundCloud, Bandcamp | Artists, genres, listening time, audio features, saved status |
| 🎮 Gaming | Steam, PlayStation, Xbox, Nintendo | Games, playtime, genres, achievements, co-op preference |
| ⌨ Code | GitHub, GitLab, Bitbucket | Languages, topics, repos, commit cadence, stars |
| 🍺 Beer & Drinks | Untappd, Vivino, Distiller | Styles, ratings, breweries, check-in frequency |
| 🎬 Film & TV | Letterboxd, Trakt, IMDb | Films, directors, genres, star ratings, lists |
| 🏃 Fitness | Strava, Garmin, Apple Health | Activities, distances, consistency, training style |
| 📚 Books | Goodreads, StoryGraph, Bookwyrm | Authors, genres, reading pace, star ratings |
| 🎫 Events | Meetup, Eventbrite, Ticketmaster, Dice | Event types, venues, attendance frequency |
Validated psychological instruments administered within the app - personality type indicators (Myers-Briggs, Big Five, 16PF), attachment style classification, and communication style profiling.
How it works: Users complete a structured questionnaire during or after onboarding. Responses are scored against established frameworks to produce a type or profile. The assessment engine handles scoring internally - the user sees their result, and the matching engine receives a structured profile it can compare.
Characteristics:
- Standardised format. Unlike external data (which varies wildly by platform), assessments produce a consistent, comparable output for every user.
- Research-backed scoring. Personality compatibility uses published research on type pairings (i.e. cognitive function alignment for MBTI, trait distance for Big Five) rather than naive letter-matching.
- Inverse compatibility. Some personality dimensions are more compatible when opposite - an introvert and an extrovert may complement each other. Assessments are the primary source of inverse compatibility signals (Markey & Markey, 2007).
- Requires user effort. Unlike passive external data, assessments require the user to answer questions. The UX must justify the effort by clearly showing how results improve match quality.
A values-and-preferences question system covering lifestyle, relationship goals, and personality - topics that external integrations cannot capture. Each question collects three dimensions of data.
How it works: Each question presents multiple-choice options. The user provides:
- Their answer - what they would personally choose.
- Acceptable answers - which answer(s) they would accept from a match. This enables inverse compatibility: you might answer "introvert" but accept "extrovert" in a partner.
- Importance - how much this question matters to them, from "Not important" to "Dealbreaker".
Characteristics:
- Bidirectional acceptability. Matching checks whether both users find each other's answer acceptable. A match requires mutual acceptance.
- Importance weighting. Dealbreaker conflicts halve the Q&A sub-score. Low-importance questions contribute minimally even when they mismatch.
- Not interest-specific. Structured Q&A covers cross-cutting topics ("Do you want children?", "How do you handle conflict?", "How important is alone time?") - not niche hobby data. Interest-specific depth is handled by the fourth category.
- Consistent schema. Every question produces the same three-dimensional output, making it straightforward to score without the normalisation challenges of external data.
A structured data collection system for interest categories that lack strong external integrations - pets, plants, board games, cosplay, astrology, vintage/thrift, and many more. This is the category that turns Affinity Atlas's questionnaire system from a chore into a genuine matching advantage.
How it works: Questions are organised in branching hierarchies that adapt based on previous answers. The system starts broad and drills down based on the user's responses:
Why this is a key differentiator:
- Produces Tier 1-equivalent data without an API. Frequency and quantity answers ("How often do you walk them?" โ twice daily) become
EngagementFactorinputs. Preference strength ratings ("How important are pets to you?") becomeSentimentFactorinputs. Answer rarity across all users ("I have a Maine Coon" is rarer than "I have a cat") drivesNicheWeight. The result is full Affinity scoring - presence, engagement, and sentiment - through structured questions rather than API connections. - Adaptive depth. Users who say "No" to pets skip the entire pet sub-tree. Users who say "Yes" get progressively deeper questions. This means engaged users provide rich data while uninterested users face zero friction.
- Niche weighting from answer distributions. The platform tracks how common each answer is across all users. "I breed Savannah cats" is vastly rarer than "I like cats" - the same NicheWeight logic from external integrations applies, but computed from platform-level answer frequencies.
- Covers the long tail. There will never be a "Pet API" or a "Board Game Ownership API" with engagement data. Interest Q&A fills the gap for dozens of interest categories that would otherwise be limited to Tier 3 (presence-only) scoring.
How the four categories relate to the algorithm: All four categories feed into the same core formula. External integrations typically produce Tier 1/2 data. Assessments produce structured type profiles with dedicated comparison functions. Structured Q&A produces importance-weighted acceptability scores. Interest Q&A produces Tier 1-equivalent Affinity scores. The algorithm does not care where data comes from - it scores Affinity, NicheWeight, SignalWeight, and CategoryMultiplier identically regardless of source.
2. Hierarchical data modelling
The matching engine does not treat data as flat lists. Within every data category, information exists at interconnected levels - and the engine is aware of the hierarchy. This is a design requirement, not an optimisation.
2.1 External integration hierarchies
Every external data category has an inherent hierarchy. The matching engine operates at every level, each with its own Affinity and NicheWeight calculations.
Music
Gaming
Code
Film & TV
Books
Beer & Drinks
2.2 Interest Q&A hierarchies
Interest Q&A hierarchies work differently from external integrations - they are defined by question branching rather than API object relationships. But the principle is identical: data exists at multiple levels, and matching happens at every level.
2.3 Aggregation: filling gaps in the hierarchy
Not every platform or API exposes every level of the hierarchy equally. The data architecture handles this through upward and downward aggregation.
Upward aggregation: If we have rich song-level data (listening history, saved status, minutes played) but no explicit genre affinity, we compute genre-level values by aggregating upward. A user's top genres are derived from the genres of their most-listened artists, weighted by engagement. If their top 5 artists are all in Progressive Metal, their genre affinity for Progressive Metal is high - even though no API returned "this user likes Progressive Metal".
Downward inference: If an API exposes only artist-level data (no individual track data), we can still infer genre affinity from the artist catalogue. An artist tagged with "shoegaze" and "dream pop" contributes to both genre nodes. The more artists a user has in a genre, the stronger the signal.
This aggregation pipeline ensures the matching engine functions at every level of the hierarchy, even when the source data only exists at one level. It is not a fallback - it is the normal operating mode for most data sources.
3. Cross-domain interconnectedness
Data categories are not isolated silos. Entities, concepts, and even specific items appear across multiple categories simultaneously. The normalised data layer is designed to recognise and exploit these connections.
Shared entities across categories
Genres span categories
"Horror" is a genre in Film & TV (Letterboxd), Books (Goodreads), Gaming (Steam), and Anime (MyAnimeList). A user who watches horror films, reads horror novels, and plays horror games has a cross-category signal that is stronger than any single-source match. The normalised data layer maps genre tags to canonical genre entities that exist independently of their source category.
Artists span categories
An artist can appear in Music data (Spotify), Film data (as a soundtrack composer or actor), Event data (concert attendance via Ticketmaster), and even Gaming data (in-game soundtracks or music rhythm games). Trent Reznor appears in a user's Spotify data as a musician, in their Letterboxd data as a film composer ("Soul", "The Social Network"), and potentially in their event history as a concert attendee. These are not three separate signals - they are three facets of the same underlying taste.
People cross boundaries
An "author" in Books (Goodreads) might also be a "screenwriter" in Film (Letterboxd) - Neil Gaiman appears in both. A "director" might also appear in a user's podcast subscriptions (a director's commentary podcast). The data model tracks canonical person entities that can appear in multiple roles across categories.
Events are inherently cross-category
Event platforms (Meetup, Eventbrite, Ticketmaster, Dice) represent the strongest cross-category signal. A single event calendar can reveal:
- Music taste (concert attendance)
- Tech interests (meetup attendance)
- Fitness habits (parkrun, climbing events)
- Social preferences (pub quizzes, board game nights)
- Food & drink taste (food festivals, wine tastings)
The normalised data layer maps event tags into the appropriate existing categories as well as providing event-specific signals like attendance frequency and venue diversity.
Why interconnectedness matters for matching
When two users share a taste that appears across multiple categories, the signal is reinforced. If both users love horror films and horror novels and horror games, the algorithm scores each overlap independently - but the cumulative effect is a strong multi-dimensional match that a single-category system would miss.
This is not a bonus or a multiplier - it emerges naturally from the formula. More shared data points across more categories means more terms in the numerator of the core formula. Cross-category taste naturally produces higher scores because it represents broader, deeper compatibility.
Entity deduplication: The normalised data layer must ensure that the same underlying entity (i.e. "Horror" as a genre, or "Trent Reznor" as a person) is not double-counted within a single category. If a user's Spotify data and Apple Music data both reference the same artist, that artist should appear once in the Music category with the combined engagement data - not twice. Cross-category appearances (Trent Reznor in Music vs Film) are not deduplicated because they represent genuinely different signals.
4. Data tiers and sparse information
The most important architectural decision in the data layer is how to handle varying levels of data richness. Not every source provides the same depth. Some give engagement metrics and ratings; others give only a list of items. The algorithm must work with whatever is available - and be honest about the confidence of the result.
- Tier 1 - Rich (Presence + Engagement + Sentiment)
- All three Affinity dimensions are active. Sources: Spotify (minutes listened + saved/liked status), Untappd (check-ins per style + average rating), Letterboxd (watched count + star rating), Goodreads (books read per author + star rating). Interest Q&A also produces Tier 1-equivalent data through frequency/quantity answers (engagement) and preference strength ratings (sentiment). Highest confidence.
- Tier 2 - Moderate (Presence + Engagement OR Sentiment)
- Two of three dimensions active; the missing one defaults to 1.0. Sources: Steam (hours played as engagement, implicit sentiment from <2h = 0.3), GitHub (repos per language as engagement, no rating). Good confidence.
- Tier 3 - Basic (Presence only)
- Affinity = 1 if both users share an item, 0 otherwise. No engagement depth, no sentiment. Sources: simple self-reported interest lists, platform follows without depth data. NicheWeight is estimated from Affinity Atlas platform-level frequency (how many users share this item). Low confidence, but still contributes.
The sparse data problem
Sparse data creates cascading challenges throughout the system:
1. Within a single source
A user connects Spotify but has only listened to 8 artists in the last month. The matching engine has very few data points to work with. NicheWeight might be meaningful for those 8 artists, but the coverage is thin. The algorithm handles this naturally - fewer data points means fewer terms in the numerator, which means the score is based on less evidence. The normalisation approach ensures this does not penalise the user.
2. Across sources within a category
One user has Spotify (Tier 1), the other has Apple Music (Tier 2, if the API provides less data). The matching must happen at the category level (Music) despite the sources providing different tiers of data. The normalised data layer maps both into a common schema, but the Affinity calculation must use the lowest common tier for shared items. If Spotify provides engagement and Apple Music does not, the engagement data from Spotify is retained for within-platform comparisons but cannot be used for the cross-platform match on that specific item.
3. Across categories
This is the shared-source normalisation problem. If User A connects only Spotify and User B connects Spotify + Steam + GitHub + Untappd + Goodreads, the score is calculated only on Spotify. User B's extra data does not penalise User A. This is a deliberate design choice documented in detail in the algorithm deep-dive.
4. Within Interest Q&A
A user answers "Yes" to "Are you a pet person?" and "Dogs" to "What kind of pets?" - but then drops off before answering breed, care routine, or dealbreaker questions. The system has partial hierarchy data. The matching engine handles this by scoring at the deepest completed level. Two users who both say "Dogs" get a presence match at that level. If one also specifies "Golden Retriever" and the other does not, the breed-level comparison is skipped (not penalised) for that pair.
The honest approach: Rather than inventing data to fill gaps (see the Bayesian prior method we rejected), the system is transparent about what it does and does not know. The category breakdown on every match card shows which categories contributed to the score and which were "not connected" or "not enough data". Explore prompts encourage users to add more data - but the algorithm never penalises them for what they have not provided.
5. The standardisation problem
This is the hardest engineering challenge in the entire system. From the user's perspective, "I listen to Igorrr on Spotify" and "I listen to Igorrr on Apple Music" are the same thing. From a data perspective, they are completely different: different API structures, different IDs, different metadata schemas, and - critically - no shared identifier.
The matching engine operates at the category level (Music, not Spotify). If one user connects Spotify and another connects Apple Music, they need to be matched on shared artists, genres, and listening behaviour. This means the normalised data layer must resolve "Spotify artist ID abc123" and "Apple Music artist ID xyz789" to the same canonical entity. This is the cross-platform identity resolution problem.
5.1 Cross-platform identity resolution
Every platform uses its own identifier system. No two platforms share IDs.
| Platform | Artist identifier | Example for "Igorrr" |
|---|---|---|
| Spotify | spotify:artist:{22-char-base62} | spotify:artist:03kp4aSGGfiT0VgN... |
| Apple Music | Numeric catalog ID | 287813189 |
| Last.fm | MusicBrainz MBID (when available) | mbid:a1b2c3d4-... or name string |
| MusicBrainz | UUID (MBID) | a1b2c3d4-e5f6-... |
| Discogs | Numeric artist ID | 2567891 |
The normalised data layer resolves these through a multi-signal matching pipeline:
- Exact name + metadata match. If two sources return an entity with the same name, same genre tags, and similar metadata (release years, album count), they are likely the same entity. Confidence: medium-high.
- External catalogue lookup. Services like MusicBrainz, Wikidata, and ISRC (International Standard Recording Code) provide cross-platform identifiers. When available, these are the gold standard for resolution. Confidence: high.
- Fuzzy matching with disambiguation signals. When names are similar but not identical ("Bjork" vs "Bjork" with accented characters vs "Bjork Gudmundsdottir"), the pipeline uses album discographies, genre overlap, and active years to disambiguate. Confidence: medium.
- Manual override / community correction. For edge cases, the platform maintains a correction table that maps known problematic entities. This is a last resort.
Spotify: "Igorrr" (artist ID abc123)
Apple Music: "Igorrr" (catalog ID 287813189)
MusicBrainz MBID lookup: both map to the same MBID.
✔ Canonical entity: Igorrr (MBID: a1b2c3...)
The user who listens on Spotify and the user who listens on Apple Music are matched on the same artist.
Spotify: "Aurora" (Norwegian singer)
Apple Music: "Aurora" (Korean band)
Same name, different artists. No shared MBID.
✘ Must use genre + discography to disambiguate
Without disambiguation, the system would falsely match two users who listen to completely different artists.
The same challenge applies across every data category:
- Gaming: "Hades" on Steam (app ID 1145360) vs "Hades" on PlayStation (CUSA-18498) vs "Hades" on Nintendo eShop (different ID). Same game, three different identifiers. Additionally, "Hades" could refer to the 2020 Supergiant game or the 2018 Early Access version - or a completely unrelated older game.
- Books: "1984" by George Orwell has different Goodreads IDs, ISBNs (hardcover vs paperback vs Kindle), and StoryGraph entries. The canonical entity is the work, not the edition - but most APIs return edition-level data.
- Film: Letterboxd and IMDb use different ID systems. A film might have multiple entries across platforms (original release vs director's cut vs remaster). Some Letterboxd films are tagged with TMDb IDs; others are not.
- Beer: Untappd uses its own beer/brewery IDs. Vivino uses wine-specific identifiers. A crossover product (a craft beer from a winery) might appear in both systems with no linking identifier.
5.2 Schema normalisation
Beyond identity, each platform returns data in a completely different structure. The normalised data layer must map platform-specific schemas into a common format per category.
| Data point | Spotify | Apple Music | Last.fm |
|---|---|---|---|
| Popularity | popularity: 0-100 (direct) | Not exposed in API | listeners: count (raw number) |
| Play count | Not in standard API (limited to top tracks) | playCount per song | playcount per artist/track |
| Genres | On Artist object (array of strings) | On Song/Album (single genre field) | Tags (user-generated, noisy) |
| Audio features | danceability, energy, valence (0-1) | Not available | Not available |
| Saved/liked | Boolean per track/album/artist | inLibrary boolean | loved boolean |
The normalised music schema must handle: popularity available from one source but not others, play count format differences, genre granularity differences (Spotify's genre system is more granular than Apple Music's), and features that only one platform provides. Missing fields default to null and are handled by the tier system.
The normalisation process per category follows a consistent pattern:
- Extract: Pull raw data from the platform API via the integration adapter.
- Map: Transform platform-specific fields into the common category schema. Known fields map directly; unknown fields are dropped or flagged for future support.
- Resolve: Run cross-platform identity resolution on entities (artists, games, books, etc.) to map to canonical IDs.
- Merge: If a user has multiple sources in the same category (i.e. both Spotify and Last.fm), merge their data. Prefer the richer source for engagement metrics; combine for breadth.
- Classify: Assign a data tier to each entity based on what dimensions are available (presence, engagement, sentiment).
5.3 Name disambiguation
The most insidious standardisation problem is not different IDs - it is identical names for different entities. This is far more common than most people assume.
The scale of the problem
- Music: There are at least 7 different artists named "Aurora" on Spotify alone (Norwegian singer, Korean band, Finnish metal band, and more). MusicBrainz lists 30+ artists named "Aurora" or variations. The problem scales with common names: there are dozens of artists named "The National", "Daughters", "Beach House" (most are the same, but some are not).
- Books: Searching "James Patterson" on Goodreads returns the thriller author - but also James Patterson the children's author (same person, different catalogue segmentation) and unrelated authors with similar names.
- Gaming: "Doom" could refer to the 1993 original, the 2016 reboot, or Doom Eternal. "Final Fantasy" followed by a number could be the original or a remaster. Some platforms list both; others only the latest version.
- Film: Multiple films share names across decades. "Dune" (1984) and "Dune" (2021) are different films. "The Batman" (2022) vs "Batman" (1989) vs the animated series.
Disambiguation strategies
- Multi-field matching. Never match on name alone. Always use name + at least one secondary field (genre, year, discography, publisher, director). Two entities named "Aurora" with overlapping genre tags and albums are likely the same; two with completely different genres are not.
- Canonical ID priority. When a canonical identifier exists (MusicBrainz MBID, ISRC, ISBN, TMDb ID, IGDB ID), prefer it over name matching. This handles the vast majority of cases for popular entities.
- Confidence scoring. Each resolution gets a confidence score (0-1). High-confidence matches (>0.9) are auto-resolved. Medium-confidence matches (0.5-0.9) are resolved but flagged for potential review. Low-confidence matches (<0.5) are treated as separate entities until more data is available.
- User-facing transparency. When the system is uncertain about a resolution, it errs on the side of not matching rather than creating a false positive. It is better to miss a cross-platform match than to incorrectly merge two different entities.
False merge is worse than missed match. If the system incorrectly decides that two different "Aurora" artists are the same entity, it corrupts the matching engine - a user who likes Norwegian art-pop would be matched with a user who likes Korean pop. This is a category error, not a minor inaccuracy. The disambiguation pipeline is therefore conservative: uncertain matches are kept separate.
5.4 Niche weight reconciliation
Even when entities are correctly resolved, their popularity scores differ across platforms. An artist might have a Spotify popularity of 42 but a Last.fm listener count that maps to a very different percentile. Which NicheWeight should the matching engine use?
- 1. Prefer platform-native popularity
- If matching two Spotify users, use Spotify's popularity score directly. It is the most contextually accurate. This is the simplest approach and works when both users are on the same platform.
- 2. Normalised cross-platform popularity
- When users are on different platforms, compute a blended popularity: normalise each platform's metric to a 0-100 scale using percentile ranking within that platform's population, then average.
nicheWeight = 1 - avg(normPopA, normPopB) / 100. This prevents one platform's inflated or deflated popularity from dominating. - 3. Affinity Atlas internal frequency
- For items where no platform provides a clean popularity score, fall back to internal frequency: how many Affinity Atlas users have this entity in their data? This is always available (by definition) and grows more accurate as the user base grows. It is the universal fallback for Tier 3 data.
The reconciliation follows a priority order: platform-native (when both users are on the same platform) > cross-platform normalised blend > internal frequency fallback. This ensures the most accurate NicheWeight is used in every scenario.
6. The normalised data layer
The normalised data layer is the architectural core that sits between raw integration data and the matching engine. It is the component that makes "matching at the category level, not the platform level" possible.
- Integration adapters
- One adapter per platform (Spotify adapter, Steam adapter, GitHub adapter, etc.). Each adapter knows how to authenticate, fetch permitted data, and transform it into the common category schema. Adding a new platform to an existing category requires only a new adapter - the matching engine, UI, and scoring logic are unchanged.
- Entity resolution
- The cross-platform identity resolution pipeline described in section 5.1. Maps platform-specific IDs to canonical entities.
- Hierarchy construction
- Builds the hierarchical data model described in section 2. Fills in missing hierarchy levels through aggregation.
- Tier classification
- Assigns a data tier (1, 2, or 3) to each entity based on available dimensions. This tier determines which Affinity components are active.
- NicheWeight computation
- Computes popularity/rarity for each entity using the reconciliation strategy appropriate to the context.
- Cross-category linking
- Identifies entities that appear across categories (a genre in Film and Books, an artist in Music and Film) and tags them for cross-category reinforcement awareness.
- Privacy enforcement
- Stores raw API data encrypted and access-controlled. The matching engine only receives normalised, derived signals - never raw data. User-facing profiles only show what the user has explicitly approved for display.
The normalised data layer is what makes the promise of "matching on shared taste, not shared apps" technically achievable. Without it, a Spotify user and an Apple Music user could never be compared. With it, they are matched on the same canonical artists, genres, and listening patterns - regardless of which platform provided the data.
7. Design principles
The data architecture is guided by seven principles:
- Category over platform. The matching engine operates on interest categories (Music, Gaming, Film), not platforms (Spotify, Steam, Letterboxd). Platforms are data pipes; categories are signals. Two users on different platforms should be matchable on the same underlying taste.
- Hierarchy over flat lists. Data within each category exists at multiple interconnected levels. The matching engine operates at every level - shared genres, shared artists, shared tracks - each contributing its own Affinity and NicheWeight to the score.
- Graceful degradation. The system must produce meaningful results with whatever data is available, from Tier 1 (rich, multi-dimensional) to Tier 3 (presence-only). Missing data is handled honestly - no fabrication, no penalisation.
- Conservative resolution. When cross-platform identity is uncertain, prefer false negatives (missed matches) over false positives (incorrect merges). A missed overlap is a minor loss of signal; a false merge corrupts the matching engine.
- Privacy by architecture. Raw API data is stored encrypted and never exposed to other users or the matching engine. Only derived, normalised signals flow through the system. Users control what is displayed on their profile independently of what is used for scoring.
- Adapter isolation. Adding a new integration to an existing category requires only a new adapter. The matching engine, scoring logic, UI, and normalised schema are untouched. This is the architectural foundation for scaling to dozens of platforms.
- Transparency. Every step of the data pipeline - from raw API data through normalisation to Affinity scoring - is documentable and explainable. Users should be able to understand not just their compatibility score, but what data produced it and how that data was collected.
Related reading: The algorithm deep-dive documents how the normalised data is scored - every formula, every component, every alternative approach considered. This page documents how the data arrives at the algorithm's front door.