Using wget directly to scrape IMSLP for PDFs is technically possible but highly problematic and likely violates their terms of service. Here's the technical reality:
IMSLP Architecture:
- Built on MediaWiki with URLs like /wiki/Title
- PDF links point through download.php?file=... or direct paths under /files/
- PDFs are stored in structured directories but not publicly indexable as a simple directory listing
Why wget alone won't work well:
- robots.txt restrictions: IMSLP explicitly disallows bulk automated access in their robots.txt, requiring crawlers to respect crawl delays (typically 5+ seconds)
- Dynamic content: The PDF URLs aren't static - they're generated through the download script based on file IDs
- Rate limiting: Aggressive scraping will get your IP blocked quickly
- Navigation complexity: You'd need to parse every composer page → work page → score page → extract the PDF link, which requires full HTML parsing
Better technical approaches:
Option 1: MediaWiki API (Recommended)
<span><span># Get all pages for a category</span><span><br></span></span><span><span></span><span>curl</span><span> </span><span>"
https://imslp.org/api/rest_v1/page/category/Pages:_Public_Domain"</span><span> </span><span>|</span><span> jq </span><span>'.items[]'</span><span> </span><span>|</span><span> </span><span>while</span><span> </span><span>read</span><span> item</span><span>;</span><span> </span><span>do</span><span><br></span></span><span><span> </span><span>curl</span><span> </span><span>"
https://imslp.org/w/api.php?action=query&titles=</span><span>${item}</span><span>&format=json&prop=revisions&rvprop=content"</span><span> </span><span>|</span><span> </span><span>\</span><span><br></span></span><span><span> </span><span>grep</span><span> -o </span><span>'download\.php\?file=[^"&]*'</span><span> </span><span>|</span><span> </span><span>sed</span><span> </span><span>'s/^/https

/\/imslp.org\/wiki\//'</span><span><br></span></span><span><span></span><span>done</span></span>
Option 2: Limited wget with proper constraints
<span><span>wget</span><span> --recursive --level</span><span>=</span><span>2</span><span> </span><span>\</span><span><br></span></span><span><span> --wait</span><span>=</span><span>10</span><span> --random-wait </span><span>\</span><span><br></span></span><span><span> --user-agent</span><span>=</span><span>"Mozilla/5.0 (compatible; IMSLP-Bot)"</span><span> </span><span>\</span><span><br></span></span><span><span> --accept</span><span>=</span><span>.pdf </span><span>\</span><span><br></span></span><span><span> --convert-links </span><span>\</span><span><br></span></span><span><span> --page-requisites </span><span>\</span><span><br></span></span><span>
https://imslp.org/wiki/Category:Public_domain_works</span>
Critical warnings:
- IMSLP's usage policy explicitly discourages aggressive scraping
- Only works in public domain in Canada may be downloaded legally
- Your IP will likely be banned after moderate volume
- The site may implement CAPTCHAs or other anti-bot measures
Legitimate alternative:Use the
IMSLP MediaWiki API to programmatically retrieve metadata and construct proper download URLs. This respects their rate limits and is the intended method for programmatic access.
If you proceed anyway:
- Implement exponential backoff on 429/503 responses
- Add random delays between requests (5-15 seconds)
- Use a legitimate User-Agent string
- Respect any error messages about rate limiting
- Consider that this could result in permanent IP bans
The most reliable approach is to check if there's an existing tool like the
jlumbroso/imslp project that already handles the API interactions properly.