Clustering job postings using BERT embeddings

5 min readMay 29, 2021

The aim of this post is to…

… highlight an unsolved problem: It is difficult to get an overview of the skills currently demanded by the job market
… share the lessons learned from my attempt to achieve such an overview

The problem

Whenever we choose a new education (or a new job as a young professional), we take into account the job market demand:

How relevant will my newly acquired skills be? What jobs could I get with them?

Probably, we should even start by looking at potential future jobs first, check their required skills and only then pick a suitable education / job in which we acquire such skills. However, getting an overview of the skills demanded by the job market is difficult:

Most education providers (e.g. universities) have a section titled “With this education you can…”. While this may give you an idea of potential future jobs, such information is unreliable. The reason is that private as well as public education providers are obviously interested in painting a bright picture of their education program. Moreover, in designing their programs, they might lean more towards attracting students than towards targeting job market needs. After all, their primary customers are the students, not the companies. A friend experienced an extreme case: She completed a new Master’s degree program about an exciting new topic at a private international university. However, after completion, only about 10% of the graduates were able to find a job because the mixture of acquired skills just did not fit any job posting. Concluding, the education provider should not be your only information source.
As a second option, you could check individual job postings. While this is a reliable information source for individual jobs, it is difficult to get a big picture in this way. How many other jobs have similar skill demands? Are there typical skill profiles? Am I at a crossroad and should decide to become an expert in either skill A or skill B? In order to answer such questions, analyzing individual job postings does not seem to be the optimal way: While you can manually analyze dozens of postings, you cannot analyze thousands of them in a reasonable time. Even if you could, you would need to structure that raw information into a more useful representation, which is easier said than done.

Data Science to the rescue

Since analyzing individual job postings manually seems futile, can we do it automatically? The sections below describe my attempt consisting of

Data collection and preparation
Extracting word embeddings
Graphical representation

1 Data collection and preparation

As data source I chose a German job platform which I have often used myself. I downloaded approximately 10,000 job postings as simple text files using the python library scrapy.

Since web scraping can be harmful or even illegal, please keep in mind the guidelines recommended in the following two links:

The job postings contained not only information about the required skills, but also information about the company, the advertised job, etc. Therefore, we need to some post-processing to extract the relevant part. Luckily for us, the raw text was already divided into sections with one heading per section. By using some heuristics, we can determine the section containing the required skills. An exemplary file would look like this:

You are our ideal candidate if you have...
- Excellent coding skills in PHP7 of minimum 5 years
- Practical experience with Frameworks like Laravel and Zend
- Solid understanding of RESTful API’s
- Comfortable working in an English-speaking & agile environment

2 Extracting word embeddings

In order to cluster the skills, we need to somehow calculate the similarity between the text in two skill sections. Based on https://medium.com/@adriensieg/text-similarities-da019229c894, BERT embeddings seem like a great tool for this purpose. BERT is a transformer-based natural language processing model developed by google. When we feed it a sentence, it can compute word embeddings as 768-dimensional vector for each word. In contrast to e.g. word2vec models, the vector for each word is not constant but depends on the context of the sentence.

Google provides pretrained BERT models for English. For German, deepset.ai provides a pretrained model and even a framework called FARM to simplify transfer learning with BERT. Let’s use it. Computing the word embeddings is then as simple as

from farm.infer import Inferencermodel = Inferencer.load(
    model_name_or_path="bert-base-german-cased",
    task_type="embeddings",
    gpu=True,
    batch_size=32,
    extraction_strategy="per_token",
    extraction_layer=-2,
)
example_sentence = "Hello world!"
embedding = model.inference_from_dicts([{"text": example_sentence}])

In this way, we can convert each word in a job posting skill section to a 768-dimensional vector. These vectors can simply be averaged, so that each job posting can be represented by a 768-dimensional vector.

3 Graphical representation

Finally, we can cluster our job postings (or more exactly their skill section). To visually identify different clusters, we will plot the points in two dimensions using the UMAP dimensionality reduction technique, which is similar to the more widely known t-SNE but more computationally efficient. The usage is very simple:

# x is our (10000,768) vector as np.ndarray
umap_2d = UMAP(random_state=0)
umap_2d.fit(x.data)
projections = umap_2d.transform(x.data)  # has shape (10000,2)

Let’s plot the results !

This looks extremely promising: we can clearly identify 4–5 clusters. Maybe they represent different skill areas such as IT, engineering, economy, social?

Let us analyze the points within a cluster manually using the extremely useful hover feature in plotly express. After some fine tuning we can easily read the raw information for each point in the cluster:

And quickly it turns out: The clusters do not actually represent similar skill profiles. Instead, they represent similar writing styles! More exactly, the individual clusters represent skills, which…

… are written in English (the majority is written in German)
… use “Sie”, the German formal word for “you”
… use “Du”, the German informal word for “you”
… comprise one or multiple blank lines
… the “rest” (rather uniform distribution without obvious subclusters)

Conclusion and future work

Obviously, this result is neither expected nor does it tackle the problem mentioned at the beginning in any way. But I hope that it is still instructive.

The fact that this approach didn’t work doesn’t mean that the idea is futile: Only last week, Amamou Walid proposed another similar approach here on medium, see https://medium.com/mlearning-ai/building-a-knowledge-graph-for-job-search-using-bert-transformer-8677c8b3a2e7. While he also uses BERT, he uses named-entity recognition to extract very specific information from job posting such as the required major and the years of experience for each required skill. This achieved great results for job-postings by Facebook, Google, Microsoft, IBM, and Intel. These have probably all a rather similar structure, so I would be extremely curious to see how well the approach works for job postings from various companies.

As Károly Zsolnai-Fehér likes to say: Let’s see what “one more paper down the line” will bring!