How to use this Python script? - Bro where is install.exe

  • 🔧 Site instability resolved. You can report double-posts and broken attachments. For bigger issues, use the Technical Grievances thread.
    🇵🇦 Nuestro primer dominio localizado está en español en kiwifarms.pa. Our first localized domain is on Spanish on kiwifarms.pa.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account

sodomy lifestyle

Raising money for a bail bond
True & Honest Fan
kiwifarms.net
Registrado
16 de Ago, 2024
Hi, I'm trying to collect Cyraxx soundbites, and figured the easiest way to comb through literal days of whining is to download the YouTube transcripts and search them for key phrases. I found this useful looking tool on GitHub:

youtube-bulk-transcript

What is the easiest way to utilize it? Do I need to resort to a command prompt and/or batch files, or is there a graphic command interface available that I could use? I'm an absolute coding nigger.
 
Download the python interpreter here, my guess is you'll want the Windows installer 64bit version, and install it.

Download the github repository and open the folder where it's located. Open the command prompt (cmd.exe) at that location (I think on windows you can do this with shift+right click in the file explorer; not sure).

Now you need to pull the project's dependencies.

Código:
python -m venv venv
.venv\Scripts\activate
pip install -r requirements.txt

Now you should be able to use the commands from the README to download your transcripts.

EDIT: I tried the script and it's out of date. I managed to fix it up. If you send me the channel/playlist that you want scraped, I'll do it for you and send you the transcripts in a zip. Or you can try yourself with this edited version.
 

Archivos adjuntos

Última edición:
If you send me the channel/playlist that you want scraped, I'll do it for you and send you the transcripts in a zip.

If you're willing to do a favor to a computing mong, the channel I'm eyeing is www.youtube.com/@GoblinRecordsOfficial . With over a thousand clips, it'll keep me busy scrubbing probably for the rest of the year. If the script allows attaching the related video URL and title to each transcript, that would be helpful.
 
Are you familiar with using yt-dlp to download Youtube videos? It can also download subtitles.
If you have yt-dlp installed properly you can run this command in a terminal.
Python:
yt-dlp --skip-download --ignore-config --ignore-errors --write-subs --write-auto-subs https://www.youtube.com/channel/UCgpVO5oxAh7oMk3vynU-2Vg
BTW make sure you've updated yt-dlp recently if you try this.
 
Última edición:
Are you familiar with using yt-dlp to download Youtube videos? It can also download subtitles.

Does it allow downloading all of a channel's subtitles on one go? I'm seeking a solution where I can avoid having to pick individual clips. There are several online services that offer something similar, but the best one I've found has a limit of transcripts of the hundred most recent clips.
 
Here.

P.S. Its a compressed tape archive so you don't feel like it was free.

Edit: I can't attach it apparently: https://files.catbox.moe/yg0vgw.gz
The original extension is ".tar.gz".

Oh dear, this just leads to another issue. Now I would need a way to batch convert the .vtt format to plaintext.

:thinking:

[edit]

Essentially the ideal end result is an uninterrupted wall of text per clip, perfect for searching long phrases.

wordswordswords.png
 
Última edición:
Oh dear, this just leads to another issue. Now I would need a way to batch convert the .vtt format to plaintext.

:thinking:

[edit]

Essentially the ideal end result is an uninterrupted wall of text per clip, perfect for searching long phrases.

Ver archivo adjunto 7993816
Download Notepad++ and use Regex to remove every line that contains a '>' to remove time stamps and so on then replace all line breaks with a space. I'm not sure how that format is structured but just find a pattern in the lines you want removed then replace the line breaks. Now you have a block of text.

Don't know regex? Co-pilot and other LLMs do! Specify that you want to use it in Notepad++. You can dump all the files into NPP and then run it on all open documents. Not the fanciest solution but...
 
The script is quite dumb and doesn't account for rate limiting, so I'm IP banned atm lol. I might try again tommorow.
Use yt-dlp as @MongolianMongoose pointed out and add a wait in between each pull to help avoid being limited for future reference. yt-dlp makes it easy to pick up where you left off as well and combined with a VPN you can get around a lot of their countermeasures.

Oh dear, this just leads to another issue. Now I would need a way to batch convert the .vtt format to plaintext.

:thinking:

[edit]

Essentially the ideal end result is an uninterrupted wall of text per clip, perfect for searching long phrases.

Ver archivo adjunto 7993816
ffmpeg can do this but he threw it all in one big file which means you'd need to break it back up.

You can use this to download them yourself, add --sleep-subtitles X where X is however many seconds if you want to sleep in between.

Código:
yt-dlp --write-auto-subs --write-subs --convert-subs "srt" --download-archive archive.txt --skip-download --no-post-overwrites  -o "%(uploader)s/%(title)s [%(id)s].%(ext)s"  -a video_ids.txt --exec "before_dl:echo REMOVED ID: \"%(id)s\"; sed -i '/%(id)s/d' video_ids.txt"

This won't actually make the archive.txt file but it can't hurt to keep there for reference. The video_ids.txt file is attached with the ID of each video on the channel so it's easier to pause and resume since there are so many videos. It uses that file to get the list of videos and then removes the ID as it goes using sed.

You can say "Well I don't want to use the command line" but you're making it harder on yourself and that's on you. No one wrote a program to do this very specific task for you.

Once you have all the subtitles downloaded, you can parse them again using sed or whatever your favorite method is. I think srt is the simplest format to edit since each non-caption line is easy to parse out and then you can remove the newlines which is trivial.
Ask for help as you need it.
 

Archivos adjuntos

Última edición:
I went through and did what I think you're after for all of the transcriptions I could get, many of them require an account to be signed in. If you use the command above and add --cookies-from-browser chrome (for example) you can get some of those from the attached txt file. If you get those and want them parsed out into blocks of text, share them with me and I'll do the same thing to them that I did the others.

There are some more that I would need to re-verify that are also not in the 7zip file but I don't feel like it right now.
 

Archivos adjuntos

Atrás
Top Abajo