Mapping ~400k speeches from the Swedish parlament

data-science
embeddings
Published

July 30, 2024

This post is an exploration of the open data available at the Swedish parliament. They have done an amazing job classifying and making the data accessible though an API. The post is inspired by Umar Butlers post Mapping (almost) every law, regulation and case in Australia which explained every step along the way in a very thorough manner.

The data avaliable at the Swedish parliament is huge. So this is only an exploration of the speeches held at parliament. There are tons of data such as voting results, drafting of laws, etc.

My plan is to write this a series divided into the following parts:

  1. Mapping the speeches in 2D using embeddings.
  2. Clustering and labeling the speeches.
  3. Analisys to try to answer:
    1. Has parties in Sweden in the last 3 years gotten closer to Sverigedemokraterna regarding immigration if you use embeddings as a measuring tool?
    2. Has the positioning of all parties regarding NATO changed since Russia invaded Ukraine if you use embeddings as a measuring tool?

Note that I am not an academic in the social sciences - I do this because it interest me. This is not actual science. But if you do actual science and you have feedback, don’t hesitate to contact me!

Data gathering

In my case this is the easy part since the data is ready to download. I used my Macbook from 2017 with python 3.11 and SQLite. Of all the data that is avaliable I choose speeches becuase this was the smallest dataset. I figured it would cost to much to get the embeddings for the other data.

First step was to read the API-documentation in order to understand how to enumerate all the API-calls. Then I downloaded all the CSV-files containing a list of speeches from 1993-2024. I loaded them in SQLite and from there I could enumerate the API-calls. The actual scraping was done with scrapy and the content was saved in a SQLite database. The response was a JSON that had the following data:

{'anforande': {'anforande_id': 'c0d41037-1ffd-ee11-87dd-6805cad9744d',
               'anforande_nummer': '10',
               'anforandetext': '<p>Herr talman! Jag tackar för anförandet av '
                                'Yasmine Bladelius.</p><p>Jag noterade i '
                                'anförandet att man var glad över att '
                                'lagförslaget nu ligger på bordet för votering '
                                'i eftermiddag och att en stor majoritet av '
                                'riksdagen och av de berörda '
                                'lobbyorganisationerna är överens och positiva '
                                'till förslaget. Det man kanske missar att '
                                'säga, som framgår av tidigare replikskiften, '
                                'när man målar upp att det är '
                                'Sverigedemokraterna och Kristdemokraterna som '
                                'är skeptiska till förslaget är att det inte '
                                'bara är vi som är skeptiska. Yasmine kan ha '
                                'missat debatten de senaste månaderna, men det '
                                'är en lång rad från professionen inom '
                                'läkarvården som är djupt kritiska. Det finns '
                                'en remisslista med en rad olika '
                                'remissinstanser som har lyft upp en viss '
                                'skepsis om förslaget, och jag instämmer i '
                                'deras försiktighetsprincip. Det gäller även '
                                'Svenska Bankföreningen, Skatteverket, '
                                'Socialstyrelsen, Arbetsförmedlingen, '
                                'Kronofogdemyndigheten, Polismyndigheten, '
                                'Åklagarmyndigheten, Ekobrottsmyndigheten och '
                                'så vidare.</p><p>Demoskops senaste mätning '
                                'visar att endast 20\xa0procent av Sveriges '
                                'befolkning är positiva. Inte minst inom '
                                'Socialdemokraternas egna väljarled är en '
                                'majoritet emot förslaget. Det kan vara väl '
                                'värt att uppmärksamma de som lyssnar på '
                                'debatten att det inte är Sverigedemokraterna '
                                'och Kristdemokraterna allena som är '
                                'kritiska.</p><p>Jag har naturligtvis också '
                                'haft diskussionen hemma vid bordet med vänner '
                                'och andra i min bekantskapskrets, och jag har '
                                'lovat att ställa en fråga till '
                                'Socialdemokraterna från de kvinnor jag har '
                                'pratat med, nu när vi ändå står här. Vad är '
                                'det att vara en kvinna, när man vill förenkla '
                                'och ska kunna identifiera sig hur man '
                                'vill?</p><p>(Applåder)</p>',
               'avsnittsrubrik': 'Förbättrade möjligheter att ändra kön',
               'dok_datum': '2024-04-17 00:00:00',
               'dok_hangar_id': '5205787',
               'dok_id': 'HB09100',
               'dok_nummer': '100',
               'dok_rm': '2023/24',
               'dok_titel': 'Protokoll 2023/24:100 Onsdagen den 17 april',
               'intressent_id': '0279488554424',
               'kammaraktivitet': 'ärendedebatt',
               'parti': 'SD',
               'protokoll_url_www': 'http://www.riksdagen.se/sv/Dokument-Lagar/Kammaren/Protokoll/Riksdagens-snabbprotokoll_HB09100/#anf10',
               'rel_dok_id': 'HB01SoU22',
               'replik': 'Y',
               'systemdatum': '2024-04-18 03:02:17',
               'systemnyckel': '604815',
               'talare': 'Alexander Christiansson (SD)',
               'underrubrik': ''}}

This is all in Swedish. What we are looking for is the key anforande which means speech. The other stuff is all metadata regarding the speech such as who said it, political party, etc. The metadata will come in handy for our analysis.

Creating embeddings

This part took some trial and error because of the volume and the language. There are approx. 400k speeches - which depending on which tokenizer you use - yields 300M tokens. For my laptop and wallet that’s a lot of tokens. I first tried to use the OpenAI API but it didn’t take long before I hit my rate limit regarding API-calls. Then I tried to make batch calls but the daily rate was 3 M tokens/day for my tier.

Then I stumbled across Claude and was that they use VoyageAI for embeddings. They have better rate-limits, 1 M tokens/min and 300 calls/min. They also just released a new multilingual model for embeddings. I still wanted to understand how that model was compared to OpenAI. I was fortunate enough to find Kenneth Enevoldsens scandinavian-embedding-benchmark. I opened a request to add voyage-multilingual-2 and Kenneth was kind enough to add the model.

Now, I am not an expert but at first glance Voyage seemed like a good fit. The avarage score was lower than OpenAI. ButI guess that’s because of the benchmark for language-recognition, which is not my use case. Besides, Voyage is 1024 dimensions while the OpenAI (large) is over 3 000 dimensions. That should speed up the dimensionality reduction that I run locally.

All in all, it took about 6 hours to get the embeddings for all the content and cost about 30 USD.

Dimensionality reduction and plotting

I just followed Umar Butlers post and choose PacMAP algorithm for dimesionality reduction where I went from 1024 to 2 dimensions. It took about 75 minutes to run on my laptop. For plotting I went for plotly. You can see the result below: