Crowd Sourced Archival system - An unorthodox yet sound method for crowd sourcing archival infrastructure under one coherent system

  • 🔧 Site instability resolved. You can report double-posts and broken attachments. For bigger issues, use the Technical Grievances thread.
    🇵🇦 Nuestro primer dominio localizado está en español en kiwifarms.pa. Our first localized domain is on Spanish on kiwifarms.pa.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account

soyjersh

kiwifarms.net
Registrado
12 de Jun, 2023
This post is a call to action as well as a technical proposal for review. I claim no expertise in the systematic implementation of a solution similar to the one proposed herein or otherwise, but a modest level of familiarity with the technologies involved and an idea of how to deploy them. The reason for such a deployment is simple: a dissatisfaction with the current landscape of archival resources for Kiwifarms material, specifically those which pertain to websites. There only exists one suggested on this thread, archive.is, and while it serves the necessary purpose it exists alone and out of our control. As we’ve seen in the past with censorship performed by internet archive and the databreach/ddos of the wayback machine, other services are unreliable, and having one reliable choice is a system with a target on its back, probably one drawn there by someone who commits “consent accidents” with impunity. Therefore my desire is to utilize the untapped social capital of Kiwifarms to host a truly antifragile archive system for website snapshots that is hosted by trusted individuals and governed by, preferably, our enlightened despot Null or an adequate stand in.

The system is here explained in nontechnical language for the purposes of a more general understanding. It would involve the use of OpenZiti, tooling necessary to create a zero trust network which allows for the creation of an “invisible Internet” where only approved, cryptographically authenticated peers can communicate, and only with the services they’re authorized to see. This would allow for participants to be connected through programmed policies in an anonymous encrypted way to peers without complex routing, firewall changes, or public DNS configuration. Services can be published securely from anywhere. It decouples identity and access from the underlying topology, making it easy to securely connect microservices, containers, or IoT nodes across disparate environments.It would be possible for anyone meeting system criteria set by the network operator to join and provide their system’s resources in a highly secure use specific way. What this would best be used for in my opinion is hosting a bunch of Inter Planetary File System (IPFS) nodes, with select nodes being IPFS gateways. IPFS is a peer-to-peer, content-addressable network for storing and sharing data in a distributed fashion. IPFS, being distributed, replicates data across many nodes. This means files could exist in dozens or hundreds of locations simultaneously, ensuring resilience against single-point failures reduced downtime and global redundancy without relying on one provider.

In essence, network participants would run a node on this file system through the openziti network creating a hidden linking of various peers into a highly resilient storage pool. What would be the best use of such a pool? In my opinion it would be the hosting of ZIM file snapshots of website urls. ZIM files already achieve high compression (often reducing website size by 60–90%). IPFS adds deduplication: if multiple ZIM backups share identical chunks (say, common assets or articles), those chunks are stored only once across the network.

The combination results in massive bandwidth savings for distribution, storage optimization across multiple archives or mirrors and incremental versioning: new versions of a site only store changed blocks, not the entire archive. This makes IPFS + ZIM an ideal setup for versioned website archiving over time. Traditional backups often involve database dumps (SQL files), separate asset directories, and configuration files. Restoring them requires a specific environment and version compatibility.A ZIM file is a “frozen” encapsulating everything necessary to render the site. By hosting a ZIM on IPFS complexity is reduced to one file and one CID. Restoration becomes as easy as fetching the file and viewing it locally or serving it from an IPFS gateway. It’s also possible to integrate IPFS CIDs into a website’s metadata or Git commits for transparent public archiving.The system described so far is basically designed so that whoever runs specific OpenZiti overlay network components can decide who runs a node and helps in it in an anonymous fashion where no one really needs to know who’s who and would have a hard time doing so. There are concerns of course of a malicious node, possibly hosting content which would ideally be inaccessible.

This is where the fatefully named Kiwitrix and IPFS Gateway nodes comes into play. Any IPFS resource is identified by a content hash (CID). This means the CID is the content. If even one byte changes, the CID changes. This provides a tamper-proof fingerprint of any given dataset. For moderation, this is powerful because at Gateway nodes you can blacklist specific CIDs that represent illegal, malicious, or inappropriate archives. You can publish whitelists of verified, reputable ZIM archives. Communities can build reputation registries mapping CIDs to trust scores or moderation categories. So, instead of deleting data, IPFS moderation often means maintaining cryptographically verified blacklists and trust registries that nodes, gateways, or frontends can voluntarily follow.

The frontend in this case would be a Kiwitrix server. Kiwitrix is client and server, one for the host of various backup ZIMs and one for the end user to connect to so that they can access said ZIMs. When Kiwitrix imports a ZIM file, it also reads and stores metadata. This metadata can be used to detect and flag anomalies (e.g., fake Wikipedia mirrors or files with no provenance, categorize content based on educational, cultural, or sensitive material tags, and finally filter content visibility, only loading ZIMs signed by known public keys. Kiwitrix can therefore implement a policy-based moderation layer, where the visibility of content depends on cryptographic signatures from trusted curators, metadata validation or local administrator preferences. This transforms moderation from censorship into contextual access control.

Any technical minded reader understands the many details lacking in this explanation, but hopefully comprehends the potential such a system has, particularly in combination with what I firmly believe is a vast untapped potential of website users united under a common cause of archiving everything even against pressure of ontologically evil people, to borrow a term used by someone who is one. I have further details about my potential implementation of this system which I think are better shared with the people who would implement it for the goals it should be used to fulfill, but I’m open to questions and comments as I again am no expert on how a system like this could be used or implemented, only having somewhat the technical knowledge needed to deploy it.
 
im not reading all that shit
For non technical farms members understanding is not required. The intended purpose eventually is to have a voluntary contribution of your compute and bandwidth to provide a small fragment of a much larger network of peers for the explicit purpose of storing fragments of the epic deets only the farms can be a home for thus far. I believe firmly that fellow farmers will be willing to aid in the creation of this Akashic Record of autism.
 
I don't know much about the archive.today administration. Is it one guy? If it actually does go down we're so fucked.
Megalodon.jp is pretty good, but not the gold standard of Archive.today. Ghostarchive is wildly inconsistent but can handle scripting issues sometimes, and whole PDFs if you're lucky. Then you've got IA Wayback Machine, where the right email or phone call could get an archived page taken down, but even so, it remains very useful.

Actually getting the content isn't that much of a problem, you can screenshot, download videos, and download MHTML pages. The archive services are useful for being impartial third-parties that aren't (likely to be) manipulated, so nobody can be accused of "inspecting element" or shooping the pixels. They also allow easy coordination. If something's already been saved, I can move onto the next thing or update it.

Maybe an IPFS-based scheme can work. I'm sure a lot of us would be willing to "donate" 100 GB or terabytes for such a scheme. I'm worried about the malicious nodes inserting viruses, fake shit, etc. Maybe we just want some number of "trusted" nodes in the pile, to provide this alternative to the big 4.
 
decentralized anything has a stumbling block in requiring a certain amount of people to really get going, i.e. a torrent with 2 seeds
 
decentralized anything has a stumbling block in requiring a certain amount of people to really get going, i.e. a torrent with 2 seeds
This is true, and while the system is designed so that it can safely rely on the kindness of strangers existing infrastructure like filecoin can be used in the short term as a launching point, and even in the long term for a payment structure giving money to kiwis for operating nodes.
 
How would you stop n-chan raids from loading it up with CSAM?
Since it is ipfs it is likely difficult if not impossible to stop the existence of CSAM hosted on individual nodes if they are given access to the network, which though unlikely is admittedly possible. It does have to be emphasized that due to using openziti anyone who wants to contribute needs special permission files on their system for it to be possible let alone permitted. However it would be trivially easy to limit the CSAM existence to that one node, using whitelists of CIDs like I mentioned. Additionally even if multiple malicious nodes somehow remained on the network none of it would be accessible to the open internet unless a specially permissioned gateway node was blind to the content on it (thing's like perceptual hashing could be used to ensure meticulous absence of abuse images) and even bypassing that it would still need to beat kiwix's content filtration which could be at the URL level to ensure all ZIMs archived were from certain sites only.
 
Accessibility is key, not just for the end users but also for people who scrape.
Internetarchive's warrior container is self-sufficient and is basically set and forget.

I have members only resetera threads (mostly erectile dysfunction and suicide baiting) that could use automated scraping.
 
Accessibility is key, not just for the end users but also for people who scrape.
Internetarchive's warrior container is self-sufficient and is basically set and forget.

I have members only resetera threads (mostly erectile dysfunction and suicide baiting) that could use automated scraping.
There exists this tool which has been containerized and works quite well for what your describing but currently isn't accessible to non-technical users. I plan to make an interface for it designed for ease of use with something like portainer.
 
There exists this tool which has been containerized and works quite well for what your describing but currently isn't accessible to non-technical users. I plan to make an interface for it designed for ease of use with something like portainer.
Interesting, i do have a hoard of zims from before the wikipedia collapse. Honestly if you have the right envs you can make an easy compose file for easy and widespread deployment.
 
This means the CID is the content. If even one byte changes, the CID changes. This provides a tamper-proof fingerprint of any given dataset. For moderation, this is powerful because at Gateway nodes you can blacklist specific CIDs that represent illegal, malicious, or inappropriate archives. You can publish whitelists of verified, reputable ZIM archives.
A page that's entirely identical to how it was yesterday can have a different CID because a JS file from some CDN that's linked on the page got bumped and is now served with a different timestamp. Webshit in general likes constantly fiddling with things for no good reason. With decentralized archiving, it would probably be better to use perceptual hashing. You see a page that was archived by me, grab the hash, set the scraper's headless browser output settings to match mine and then verify authenticity of the fully rendered page.
 
A page that's entirely identical to how it was yesterday can have a different CID because a JS file from some CDN that's linked on the page got bumped and is now served with a different timestamp. Webshit in general likes constantly fiddling with things for no good reason. With decentralized archiving, it would probably be better to use perceptual hashing. You see a page that was archived by me, grab the hash, set the scraper's headless browser output settings to match mine and then verify authenticity of the fully rendered page.
Another issue I didn't elaborate on explicitly. The whole concept of CID whitelists is more nuanced than I initially described: a semantic hash extracts pages, normalizes them (strip build timestamps, canonize white space, remove cache-busters in CDN URLs or rewrite to a canonical hostname, sort resource lists) per-page and aggregate for total site ZIM(e.g., Merkle root of sorted page digests). This is then used in Inter Planetary Linked Data (IPLD) as a Directed Acyclic Graph Concise Binary Object Representation with attestation fields such as CID, the diff summary, logical identity of the website backup, semantic hash as described above a reason for updating the previous trusted version alongside signatures from maintainers of the network within whatever threshold of number of signers from the maintainers of the network.
 
With this happening I feel it's appropriate to at least ping @Null to see his opinion on this proposal. The fact is that both major political parties in the US would love someplace like archive.is gone, demanding alternatives. It's a matter of when not if such a thing is necessary.
 
Atrás
Top Abajo