House of Data - Episode 1: Theory

My blueprint to build a full-fledged system to monitor, activate and analyse intent signals at scale.

Detecting intent signals and sending notifications to SDRs in Slack is fine (not to say necessary at first). But what if we want to:

  • build build a lead scoring based on the detected signals?

  • dispatch signals in the CRM in addition to Slack?

  • browse the history of detected signals?

What i’m about to share is scarce and poorly documented.

Tinkering is over.

Time to scale.

The goal → build a system that makes it easy to:

  • tap into signals (dispatch them in the CRM, build a scoring, …)

  • iterate (add new signals to the mix, change the scoring computation, …)

  • analyse detected signals (how many, at what time, of which types, …)

that’s what i’d initially written here, in the intro…

…but then, a few weeks later, someone pulled the computer keyboard away from me and forced me to contemplate the encyclopedia i’d just gave birth to.

These were my two options:

  1. keep lying

  2. or meme my way out of it and subdivide the newsletter into two parts.

It’s already two memes in the intro only, so, option 2 it is 🌝

I’ve subdivided the newsletter in the following two parts:

  1. the first one is theoretical:

    1. i discuss first the two essential pillars of a successful intent signal detection system and why it’s not just about automation

    2. then i reveal my blueprint to build such a system — the foundations first, and how to iterate on top of it afterwards

  2. the second part is practical: we’ll discuss 3+ real-life intent signals i’ve played with during a mission with Bulldozer (this one, you’ll get next week).

So, without further ado, i present the first episode in the series:

Two pillars

Automation and…

Instinctively, when i ask “intent signals”, you reply “automation”, “Zapier” or “n8n”. We tend to think only about how data travel from their source to their destination — from LinkedIn to Slack for instance.

Don’t get me wrong: automation is vital to the kind of systems we’re building here. But there’s another pillar that’s just as important (if not more).

And, paradoxically, you must think first about the latter — before thinking about automating the hell out of everything: the storage pillar.

Why notifications are not enough?

Assume you want to ingest multiple criteria in your lead scoring:

  • static → the company’s industry, its location, etc.

  • dynamic → intent signals 🙂 (e.g “company’s hiring a head of sales”, if you’re a CRM vendor).

Now, assume that the lead didn’t trigger one, but MANY different intent signals and that the only place where you kept track of it all is Slack, as messages.

something like that for instance

Good luck to integrate the intent signals’ details in your scoring.

Good luck to not create duplicate signals in the Slack channel.

Good luck to build intent signal dashboards.

Where to store signals then?

Hopefully, i just convinced you that dropping everything in a Slack channel wasn’t sustainable. I’m still to convince you that the best place to store intent signals is a database.

Let’s rewind and go back to the lead scoring example.

Imagine if a single code snippet sufficed to retrieve all signals ever emitted by a company since a specific date — say, company with id=123456789, between 14/02/2025 and today.

SELECT
	signal_id,
	company_id,
	detection_date,
	"website visit" as signal_type -- we select the fields we're interested in
FROM website_visit_intent_signals -- in the table of website visit signals
WHERE (
	detection_date >= "2025-02-14" -- after february 14th
	AND company_id=123456789 -- emitted by company 123456789
)
UNION -- and to all of the above, we add
SELECT
	signal_id,
	company_id,
	detection_date,
	"job offer" as signal_type -- the same fields
FROM job_offer_intent_signals -- but from the table of job offers
WHERE (
	detection_date >= "2025-02-14" -- after Valentine's day again
	AND company_id=123456789 -- emitted by the same company
);

Imagine if this simple command yielded tables looking like this:

signal_id

company_id

detection_date

signal_type

web-vis-245

123456789

2025-02-14

website visit

web-vis-311

123456789

2025-02-16

website visit

job-off-44

123456789

2025-02-16

job offer

web-vis-980

123456789

2025-02-21

website visit

I believe you already understand how easy it would be to compute company 123456789 ’s score; how easy it will be to build the dashboard of all the signals detected each month; how easy it will be to ensure we didn’t duplicate the website visit of 16/02/2025.

Hence my point: for all this to be easy, you must store signals in a database.

I know… sounds a bit scary.

We all love google sheets.

But it’s not sustainable (google sheets does not prevent you from creating duplicates for instance, contrary to a well-configured database — if you needed just one reason).

So, the time has come to prove that you deserve the “engineer” in “growth engineer”.

What database?

We must store intent signals in a database. Ok.

But what kind of database?

Of course, there are multiple options, but keep this in mind:

  1. signals are either linked to a company or a contact (depending on who emitted the signal)

  2. we sometimes want to store number, sometimes dates, sometimes text, …

Therefore, we will favour a:

  1. relational database (i.e. a database where we can indicate that a signal in a table is linked to a contact in another table, itself linked to a company in another — hence the “relation”)

  2. typed database (i.e. a database where we can store strings, booleans, dates, objects, …).

I strongly recommend PostgreSQL, but we’ll address another option in the second part of this newsletter.

Blueprint

Hitherto, i explained why our system also needs a storage pillar in addition to its automation pillar. Time to introduce my blueprint and its for stages:

  1. target architecture → how to structure the database

  2. foundations → how to define and store the total addressable market

  3. identification :

    1. duplicates → how to define ids to enforce records’ uniqueness

    2. relations → how to define ids to tie tables together

  4. iterations → how to monitor and store new types of intent signals

0. Target architecture

The market (the companies and the contacts within these companies) are at the center because, ultimately, we want to sell something to these companies.

More specifically, we want to know how and when to reach out to them. Thus they hold a central place in this endeavour — literally and metaphorically.

But, unsurprisingly, it’s not the only data we store in there. We also store a variety of intent signals that revolve around the addressable market.

And, if you pay enough attention, you’ll notice that a signal is either emitted by a company or a contact in that company (the green dotted lines).

Said differently: every signal is linked to a company or contact.

Let me repeat:

1. Foundations

Time to discuss the content of that addressable market. We could design it as such for instance:

  • “every company in Omaha, with between 50 and 200 employees, that work in finance”

  • “every company on which we once opened an opportunity”

  • “all our users, no matter their company”.

That being said, i strongly advise to set strict criteria to prevent any database overflow (you’re always free to relax some of these constraints later, if your addressable market is running dry).

Get this though: as soon as you detect and intent signal, if you decide to store it in the database, you should store it alongside the company or contact that triggered it.

The above sentence hints at something you must not miss: signals not always stem from companies or contacts you already have in your database (think website visits for instance → companies visiting your website are not all stored in your database).

However, if we want to store that signal, we must store the emitting company at the same time, because… every signal is linked to a company or contact in our database (we said it earlier).

This has implications: the way you define and store signals will impact the inflation of your addressable market:

the more signals, the more companies and contacts in your database.

That’s both a blessing and a curse, because the more records in your database:

  • the harder and costlier it is to keep your data fresh and clean

  • the less focused your sales reps.

In my opinion, it’s much more profitable to have a few hundred leads about which you know everything, rather than thousands you know little about, apart from the fact that they emitted a random signal 2y ago.

So, at project launch, my suggestion is to design a few SalesNavigator searches, extract their companies and contacts and store them in your database.

Which leads to the next section.

2. Identification

2. a) Duplicates

Duplicates… Hell on Earth. The demise of any database, CRM or google sheet.

Reading the word probably has you sweating already.

Me too.

And at the same time, it makes me angry.

Because it’s a problem that’s so easy to fix.

Like, really.

You ONLY have to define unique identifiers for everything you store in your database and to keep to it.

That’s it.

That being said, though simple, it’s not benign: the way you choose and define identifiers can make or break your database’s hygiene.

For instance, you’ll avoid at all cost to use company or contact names as identifiers because:

  • there’s a hundred different ways to write them (Coca, COCA-COLA, Coca Cola Ltd., …)

  • they’re not always uniques (e.g. “John Doe” in the US, or “Jean Petit” in France)

  • they can change

    lol 🌝

Good identifiers could be websites for companies, and emails for contacts — even if both can change over time (company rebranding ; career move).

Better identifiers are their LinkedIn ids. I’ll go as far as saying that they’re the best identifiers if most of your targets are on LinkedIn:

  • they’re unique

  • they don’t change with rebrandings or career moves

  • they’re rather easy to find and extract (given the right tools).

Finally, it won’t surprise you that we also need to think about identifiers when it comes to signals (since we want to store them in the database too).

And i’ll be honest, that’s one order of magnitude more difficult because: we do need to find a unique identifier to deduplicate them, but we also need to ensure that we have a way to collect the identifier of the companies or contacts they stem from.

Let me give you an example :

We detect a website visit and want to link that signal to the emitting company (”Taco Love”). Unfortunately, two companies have the same name: one in California ; the other in Texas.

This means that we can forget about using the company name both as an identifier to deduplicate our website visit intent signal, and as an identifier to link signals to companies in the database.

You get the gist.

(we’ll discuss solutions in the second part of the series)

2. b) Relationships

You got from the above that, not only do we need identifiers to avoid duplicates, but also to link the records from different tables together.

Before we dive in, let me indulge a few definitions. We call:

  • primary key the identifier that helps avoid duplicates — it’s the unique identifier of the record in ITS reference table (the table i position myself in — a bit like a frame of reference in physics).

  • foreign key the identifier of a record that is FOREIGN to the reference table.

    Example : let’s link the records of the website visit table to that of the contacts table using the linkedInUrl as the identifier of the contact table, but foreign to website visits table.

Good.

Now, let’s talk about the links between our tables. We already said a gazillion times that every signal is either linked to a company or contact.

Ok.

But, now you also understand from the above, that, for each signal, we’ll have to define a foreign key from the companies or contacts table in order to link that signal to its emitting company or contact.

Ok.

But what about links between companies and contacts?

Well, it’s possible for a company to be linked to no contact at all (because we didn’t find then yet, because we didn’t want to search them, because they’re expensive to find, whatev)

However, a contact cannot be stored in the database if not linked to a company because:

  • reaching out to a client company without noticing is too risky (if we don’t have links between contacts and companies, how could we possibly notice that a contact works at a client company?)

  • even though we talk to humans, at the end of the day, companies are paying (lest, in b2b)

  • and, to be honest, once you have a contact, you usually have its company info as well.

That’s what means this part of the flow chart:

So, the contact table will necessarily encompass a company identifier as foreign key.

Once again, if possible, i strongly advise using LinkedIn ids, both as primary keys in each table, and as foreign keys to link tables together.

3. Iterations

Each time you’re thinking about adding a new type of intent signal to the mix, here’s what you should ask yourself:

  1. What’s the data source for that new intent signal?

  2. What data extractor are we going to use?

  3. What data am i going to get from the extractor?

  4. Of this data, what will i use to build my signals’ primary keys?

  5. Similarly, which data will i use as foreign keys to link signals to contacts and/or companies?

  6. What will be my signal’s data schema (i.e. what fields of what types)?

  7. Then (and only then), will you ask yourself: how to implement the workflow to store the signal in my database?

This roadmap has major implications you must not miss: each signal has its own table in the database.

Why? Isn’t it slightly over-engineered?

Well, here are your options:

Click on the three links above. It will be much easier to understand what i mean and that the third option is the best (i.e. one table per signal):

  • we store data with utmost granularity

  • all fields are strongly typed

  • by limiting the number of fields, we also limit the number of empty fields

  • and, finally, we can add/delete/edit any type of signal without impacting other types of signals.

TL;DR

  • the (total?) addressable market is at the centre of the database, split across two tables:

    • one for companies

    • another for contacts

  • you must define that addressable market with precise logical criteria to ensure your database does not overflow after two weeks

  • to each intent signal type its own table, either linked to the companies or the contacts table

  • if you want to store a signal whose emitting company or contact does not exist in the database yet, you must create either of them alongside the signal

  • which means that stored signals have an impact on the size of your contacts and companies tables

  • each record of each table contains one or multiple identifiers to:

    • avoid creating duplicates in the table (primary key)

    • link it to records of other tables (foreign key)

  • there’s a one-to-many relationship between contacts and companies:

    • a contact is necessarily linked to a company

    • but a company is linked to between 0 and an infinity of contacts

  • there’s a many-to-one relationship between the table of any signal and the companies or contacts table (cf. the green dotted lines on the flow chart)

Next week…

That’s it. You know everything. I’ve laid out the whole theory behind building a robust and sustainable intent signal detection system.

Next week, we will put that theory to practice:

  • i’ll introduce Cargo, the tool we used with Bulldozer (a collective/agency) during an ABM mission

  • i’ll explain how my teammate defined the addressable market

  • i’ll unfold the blueprint on three intent signals:

    • Negative Lemlist Interaction

    • Big Headcount Variation

    • Website Visit

Expect some beautiful charts:

Dopamine shop

Cheers ✌️

Bastien.

Reply

or to participate.