Feed readers are my learning project, I use it to learn new languages. I've built and rebuilt readers in vbscript, vb.net, c#, php and python. php and python have been the easiest since they have good parser libraries. Also I've used SQL Server, MySQL, SQLite and just JSON flat files. I think I've built something like 10 or so variations. In the last few I've expanded to not only pull from RSS and included Hacker News, Twitter and an enhanced pull for Reddit feeds. Though I'm not pulling Twitter currently because of some API changes that I've haven't bothered to spend time on.
Helpful hint if you need favicons for your reader you can use Google.
https://www.google.com/s2/favicons?domain=techmeme.com
The above is a load balancer for this url where the t1 subdomain may change to t[1-9] but this URL allows you to change the image size.
https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&...
I use it to grab and store sizes 16,32,48,64 of the icons with a monthly update ping.
My current iteration is built in python with a mysql backend. It's setup in a river of news style with an everything river and one for each feed and I generate topic bundles also. The feed engine is running every 15 minutes grabbing 40 feeds at a time but the static site generator is only running every 6 hours to keep me from spending all my time reading news. Since I pull in Reddit feeds I found that it's great for feed discovery.
I ran into a number of finicky issues building siftrss[1] a few years back. One I toiled over quite a bit was the discovery that Feedly, a very popular feed reader, does not support gzip. I haven't checked in recent years, but they may still not.
It's frustrating when you're forced to change the behavior of your "agnostic" application for the sake of a large, commonly-used third party tool in the ecosystem.
Where's your own RSS icon then!? ;)
All jokes aside, you just described literally all the points I encountered while developing the built-in feed reader for HeyHomepage.com Good summary!
One thing I notice a lot of people say - like you - is "forcing users to link through to read the article on the original site (semi defeating the point of subscribing via feedreader)".
I don't really agree and my own approach focuses explicitly on sending visitors to the original site. I only show the snippet, even when the full content is in the feed. Imagine you did your best for your website, made it nice and shiny, you want people to see the site as well. The original site usually contains more content, like a photo or image, which might also be useful for visitors. Besides, I want the webmasters to know I was there by showing up in the visitor statistics (I attach a '&rss_ref=heyhomepage.com' to the end of the link to the original site).
I'm not saying one way is good and the other bad - there are valid reasons for seeing a feed reader more as an aggregator - but I wanted to point out there are valid reasons for doing the opposite as well.
I also attempted to build a feed reader a while back. In the process I built a feed discovery service:
https://discovery.thirdplace.no/?q=jackevansevo.github.io
It's not perfect but it's better than a simple parsing of <link> tags in the html.
Annoying thing with RSS readers is when a website implements some sort of "security feature", RSS reader might not be able to download any feeds. I had one occurrence where feed reader was asked to complete a captcha to reach content. Being a "bot" it of course failed. Another time one website was blocking all traffic from abroad, so RSS reader just got access errors, as server is located in another country.
This is a good list. I did this at a medium scale once (about 10,000 feeds that needed to be checked once per minute).
My favorite thing he mentioned is that various tags can have different meanings. Published, updated, description, content, subtitle. To do this at scale you need some configurations for each feed to specify where you can get information. Does <published> mean published, or does it actually mean updated? Everyone does it differently.
And the etag thing. Yeah…
One thing he didn’t mention is media. I think the HN crowd really likes RSS because the mostly-text tech blogs they like to read all support it, and it seems to work fine. But a lot of the population likes to read content that has embedded images and videos. Even slideshows sometimes. There are RSS extensions for this, but they suck for all the same reasons.
At my company we ended up abandoning RSS and writing a customizable web scraper instead (ingesting HTML pages). It was actually a lot easier than dealing with RSS.
I haven't used a feed reader in a long time, but I had a brief period when I was obsessed with Fraidycat. Worth a look if you're interested in a different approach to keeping up with people.
FYI: Another feed reader I built (called pluto with sqlite as feed / data storage) see https://github.com/feedreader - used by OpenStreetMaps Blogs, Planet KDE, and others.
PS: For the (ongoing) struggle (trying) to "normalize" the RSS and ATOM feed formats (or JSON Feeds) see the feedparser gem - https://github.com/rubycocos/feedparser
Been there, done that. A lot of feeds, I means 99%+ have subtle bugs in the meta data that can be easily fixed and make feed reader writer's life easier and broaden your readership. There are rss validators, please make use of them. I have a lint tool for your blog that cross check meta data from the feed and meta data from the post:
A long time back I had a go at this too, but reimplementing ttrss's api instead of writing my own frontend: https://github.com/nvtrss/nvtrss
I learnt a lot. My goal was getting something working that the ttrss android app would connect to and I reasonably succeeded there, running it for a few years.
I went back to hosting the full ttrss application at some point.
This os great! I recognize a lot of the challenges I ran into (or decided to ignore!) When building the reader for https://havenweb.org . I had a particular chuckle at "#just for sorting", remembering feeds that kept bumping themselves to the top of my reader!
What I am missing is a robust solution for keeping my feeds (blogs, podcasts etc) in sync between multiple devices, using a standardised protocol that enables the usage of many different clients on any platform.
There has been some attempts on tackling this problem, but none have managed to get it right and become truly universal, as far as I know.
This is a great read and I will be sure to use this when a project I have in mind needs to parse a variety of feeds. So far the default .NET SyndicationFeed class works well though.
Always wished RSS/ATOM had a dedicated field for images. Why didn’t they? Currently it always seems to involve some inline HTML in a CDATA element. Pretty gross.
I worked on a feed reader back in 2006. The worst feed discovery kluge I can recall needing to special case was that certainly the most popular blog at the time (Cute Overload) was a frameset around blogger. That was typical though, people’s sites are a mess.
Writing my own feed reader was one of my unfinished side projects. Thank you for sharing your struggling.
I'd like to know what issues the author has with ttrss
> Coördinated Universal Time
The o-umlaut doesn't occur in English, and "Nate Hopper" sounds like an English name.
FWIW, I made https://readerize.com that doesn't rely on RSS. Freemium is coming soon, kindly bear with me. For now, signup/trial is free without needing a credit card.
If you don't agree with the philosophy, kindly move along, no need to downvote.
I hesitated for a long time too. One day I just decided to keep at it and launch.
> Including an ETag or Last-Modified header in the body of a request when fetching a feed is a mechanism to tell the server to only return new/modified entries/items (aka: a changeset) since a specific date.
That's not right, is it? The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries. Are there many servers optimising that to a shorter feed?
At the very least, static blogs will not filter the entries - they're serving / not serving the same file, regardless of etag.