Unlocking our Data Strategy at Thomas with Databricks

The TL;DR:

Data is central to our vision at Thomas — we need to be able to understand individuals accurately, provide accessible insights at all levels, and give actionable pathways to optimize interactions in organisations.
We needed a datastack that grew and scaled as fast as the ideas and experiments we had. Databricks revealed itself to be the obvious choice due to its lakehouse functionality, effortlessly underpinning our business intelligence tooling, as well as providing an environment for us to build cutting-edge data science solutions for our customers.
We had to move fast, and Databricks’ accessibility and ease of use made this a dream to do, allowing us to migrate away from Snowflake and onto Databricks within 4 weeks.
Databricks is now the heart of our data infrastructure: we have unlocked the ability to work with real-time data, we have been able to democratize data by getting it into the hands of the people who need it across the business, and we are sprinting forward with experiments and innovations with its data science functionality.
Where are we going next? We want to continue to use Databricks to innovate around applying machine learning and AI to understand our customers’s behavioural patterns and to make the most out of their everyday interactions at work. We are also excited about stepping into the world of Generative AI and experimenting with product features like Dolly to take our product to the next level.

Setting the scene: Who are Thomas and what role does data play in our vision?

Thomas is a leading global talent assessment platform, with over 40 years of experience in combining psychological science and people analytics to help CXOs, team leaders, and individuals get the most out of every interaction that they have in the workplace. At the core of what we do is providing smart interaction guidance for everyone, at every level, in every role and in every team — whether this is helping individuals understand how to collaborate with their peers, team leaders to coach and develop their teams, or business leaders in building an inspirational culture that drives engagement, wellbeing, and productivity.

Thomas began their digital transformation journey in early 2018, where we made the decision to transform our consultancy based business-model and invest in building a truly self-serve, SaaS platform that any individual, team leader, or CXO could get instant value from.

Data still sits at the heart of this vision. We recognised early in this journey that to be able to deliver an indispensable experience for our customers, we needed to understand what was working and what wasn’t from the data. As a result, we moved from batch processing data to working with real-time events and have increased the breadth of types of data capture in our platform by 400%, as a conservative estimate. We are beginning to learn more and more about individuals, what they valued about interactions, what drove effective and potentially dysfunctional interactions, and how we could help people unlock their full potential at work.

This massive increase in the breadth and depth of data we captured presented vast potential for us to deliver our vision, but we needed a data stack that allowed us to fully and effectively execute our data strategy.

Why did we look at Databricks? Evaluating our current datastack against what we needed for the future

We started building the foundations of our Data Science team and capabilities roughly three years into our digital transformation journey. Our remit was broad; we were responsible for development of net-new machine learning and AI features, product analytics, cross-business BI reporting, as well as building and maintaining our data pipelines and warehouses.

Our datastack at the time was a good reporting and BI engine; it let us get answers quickly from the data we had into dashboards or excel exports. But we had bigger ambitions for our data strategy and needed a data stack that unlocked our ability to reach our ambitions. Our biggest problems with our datastack at the time were:

A Home for Data Science: As a must-have, we needed dedicated space to test, deploy, and monitor our machine learning models. As a nice-to-have, we wanted a space that jump-started our data science ideas and let us hit the ground running through the built-in functionality that they had.
Working in Real-Time: We had invested heavily in building event streams in our new platform that meant we had a live feed of data coming into our data storage. We needed a data stack that let us work with this data without financially bankrupting the business.
Going Beyond SQL: SQL is great, but our team wanted to leverage their expertise in Python and R to be able to get the most out of our data. We wanted a space that let us go beyond the standard SQL and let us unleash the potential of our team in all the languages that they work in.
Data for the many, not the few: the different elements of our data stack were not well integrated, with different levels of access, licensing, and permissions for each of them. We needed something that made our infrastructure accessible and secure by allowing it to work closely with our cloud platform.

It quickly became apparent that our data stack at the time did not have what we needed for us to build our data strategy at Thomas. We evaluated a lot of different options to try and find the one that made the most sense to us. Databricks was head-and-shoulders above the pack, whilst also being a cost-neutral solution for us to migrate to. Databricks showcased its ability to deliver value parity on what we had at the time, provide us with our list of ‘must haves’ in a data stack of the future, as well as a host of other features we did not know we wanted or needed at the time.

For instance, we were able to take large strides in improving our data resiliency across the entire infrastructure; we understood more about what was happening with our data and where. We spent less time, cost, and resource moving data between systems and instead leveraged our data lake to work with our data holistically in one platform. We are a small data team — having all our data in our data lakehouse, acting as a single source of truth, rather than it being scattered across different platforms and data sources has meant we have been able to identify and resolve problems more effectively and efficiently than ever before. Data problems that previously took days of team time to fully investigate and resolve we are now able to do within hours.

The Migration: Moving from Snowflake to Databricks

With Databricks identified as our ideal provider and a comprehensive, convincing consumption plan put together with our account manager and solutions architect, we still realized we had to move quickly. We wanted to keep the migration as cost-neutral as possible, which all depended on us running two data systems side by side for as little time as possible.

This left us with 4 weeks to fully migrate our data infrastructure from Snowflake and on to Databricks. We would not recommend doing it this quickly, but our tale shows that it is very possible to migrate and get Databricks running this fast!

There were a couple of core factors that made this migration as quick and effective as it was that we would recommend considering, regardless of your timelines:

Stream-lining with the Pareto Principle: We had a vast number of pipelines and tables in our infrastructure, most of which were not doing very much (if anything). We used this opportunity to identify the 20% of pipelines and tables that were driving 80% of the value in our data infrastructure. We saw an opportunity to be brutal in simplifying our data infrastructure and re-routing users to our new centralized, stream-lined source of data. This is both a clean-up and change-management opportunity, but one that does not often present itself as neatly as this. Take advantage of this opportunity to review the different pipelines, pathways, and processes in your data infrastructure to build something simple, stream-lined, and scalable.
Databricks is ridiculously easy to use: We were able to set up secure connections with Terraform, migrate our pipelines quickly, and transfer our data products to the new system without any hassle. We did all of this in python and SQL — the languages our teams are most comfortable with!
Partner Connect as a huge time saver: Our BI tool, ThoughtSpot, plugged in seamlessly to our newly built lakehouse due to the partner connect functionality. This was a dream. We were able to unplug the platform from Snowflake and connect to a dedicated schema in Databricks with no outage (and no real realization) from the rest of the business. The only comments we received were “ThoughtSpot is running a lot quicker — have you done anything?”
The Databricks Team: the team at Databricks (massive shout out to Anna and Ollie!) offered huge amounts of support and guidance on best practice implementations so that we could hit the ground running and get it all migrated with time to spare. Before we had started, we had a plan in place. We knew where we might get stuck, what would be easier and harder, and how to tackle all of these problems before we had even started. It would not have been possible without their support!

We ended up finishing with time to spare, which allowed us to properly go back and review what we had implemented, stress-test it, and feel confident when we definitely turned off Snowflake.

What we have unlocked: from working with real-time data to machine learning solutions

The result of our migration was that Databricks became the heart of our data infrastructure. Our data ecosystem is now far simpler and easier to follow, with the Databricks Lakehouse sitting as our single, unified data platform for our data needs. Here are some of the key value points we have unlocked within the first three months of having Databricks implemented:

Getting Data-Driven Insights Faster: We are cooking with gas! (aka working with real-time data) — no more waiting a day for the latest usage metrics to come in. It’s fresh off the press and we can see how our customers are using our platform within seconds of them engaging with our product.
Scaling Data Insights to the Entire Business: Databricks underpins all our self-serve BI across the business, as well as to our partner network. The lakehouse is delivering a faster, more efficient experience for all our end users, with data insights being delivered ~40% faster than our previous datastack. Additionally, we are able to implement fixes and new solutions far quicker as a result of Databricks, freeing up ~20% of our Data Science team’s team to focus on innovation and experimentation!
Empowering Data Science: Our Data Science team is sprinting with experiments and innovations — from customer segmentation models and prototyping predictive analytics all the way through to fuzzy matching algorithms and social network analysis. We are unlocking more insights about what value our customers get from our platform as well as developing innovative ways to solve problems for them through machine learning and AI.

Where do we go from here? Databricks and our data strategy going forward

We have taken the first steps with our data strategy and have built a strong foundation with Databricks. We now want to take this further and go even faster. There are two areas where we are looking next to scale:

Making the Lakehouse the center of our universe;
Innovations and experiments in how individuals and groups optimize their interactions at work.

With making the Lakehouse the center of our universe, we want to bring all of our data into a single, unified platform to continue to drive alignment, simplification, and scalability for our data. We have taken huge strides in getting multiple sources of data into the Lakehouse, but there is more we can do. Our next focus is to build connections and pipelines that bring the last few of our data systems (e.g. our finance system and a few CRMs) into the Lakehouse. This will unlock huge value for us by creating that single source of truth, aligning the business on definitions around metrics, but also breaking down the last silos in our data that mean we can focus on the implications of the data rather than which source of data is most accurate.

Central to our data strategy is the ability to be at the leading edge of how we understand and optimize the interactions that we have in the workplace. Our Data Science team have built an R&D roadmap with four key tracks we want to explore:

How customers search for value: Our platform takes complex psychological insights about individuals and ‘translates’ it into accessible, actionable content for the end user. We want to go even further with this, and Large Language Models (LLMs) are where we are focusing next. Databricks has just released Dolly, the first open-source LLM, as a part of its functionality. We are educating ourselves on everything Dolly and are looking to fully leverage this new technology to help our customers find the valuable content they need faster and more effectively.
Recommendations & “Next Best Action”: To help people optimise their interactions, we need to make accurate and accessible recommendations on what actions they can take. As we learn how new interventions are having an impact, we want our machine learning models to continuously scale and learn with each new feature we produce.
Segmenting our Customer Behaviour: Customer segmentation as a concept is not new, but we are seeing a lot of innovative ways that it is being used by leading organisations outside of our industry. Through a combination of Databricks’ Solution Accelerators and our own models, we are experimenting and innovating around how we segment and classify our customer’s behaviour to better understand how we tailor the product experience to them.
Measuring New Things in New Ways: We want to experiment with natural language processing and latent semantic analysis models to continue innovating around how we can best help individuals express their true, authentic selves in the workplace and get the most out of every interaction.

Science