31 Commits

Author SHA1 Message Date
Alex 387237d017 Update README.md 2025-01-15 02:40:51 -08:00
muskit cb921ee911 ignore privated tweets entirely (prob the reason why my scrapers get kicked) 2024-04-29 23:55:29 -07:00
muskit 85b4bfe939 add condition for unavailable tweet 2024-04-29 17:20:23 -07:00
muskit 661c9232a3 switch to latest tweety no matter what ¯\_(ツ)_/¯ 2024-04-29 17:20:08 -07:00
muskit 8ba0394e9c add no-cache build toggle line 2024-04-29 17:13:39 -07:00
muskit 6b585ad96a print ttweet object when posting 2024-04-26 00:16:16 -07:00
muskit eadf130305 fix logic error of handling finished tweets 2024-04-26 00:15:45 -07:00
muskit 24877eb53e add cross-agency case 2024-04-26 00:15:32 -07:00
muskit 8e163e26cf extend wait to try and avoid revocation (didnt work) 2024-04-26 00:14:46 -07:00
muskit f035896226 add script to access shell in container 2024-04-26 00:13:35 -07:00
muskit 034c71abbb use unbuffered stdout to fix docker logging 2024-04-26 00:13:24 -07:00
muskit bfc9066617 remove backup tweet format that excludes screenshot 2024-04-21 16:05:37 -07:00
muskit 0203578987 add preempt tweet list handling 2024-04-21 16:04:33 -07:00
muskit 99155fdb37 add .venv to avoid long image rebuilds 2024-04-21 16:03:32 -07:00
muskit d05da5bff0 add screenshots parent tweet limit 2024-04-20 12:28:39 -07:00
muskit 61a19e3fe2 revise scripts 2024-04-20 12:28:22 -07:00
muskit d5d8db272f update run.sh 2024-04-20 01:41:54 -07:00
muskit 3d96e8532a update requirements_dev.txt 2024-04-20 01:41:42 -07:00
muskit 7fc86543e6 temporary patch for scraper lock outs 2024-04-20 01:41:27 -07:00
muskit bacc426a6d go back to passwords, use diff format 2024-04-19 23:30:28 -07:00
muskit 18dfb0a7c9 update tweety 2024-04-19 23:25:16 -07:00
muskit 1ca9fce722 move scripts into its own folder 2024-04-19 23:05:13 -07:00
muskit f1eace1f63 update scraper.py 2024-03-18 00:53:40 -07:00
muskit 608e712bce update scraper.py 2024-03-18 00:43:42 -07:00
muskit 805a1355fa update scraper.py 2024-03-18 00:42:47 -07:00
muskit 8f0825ec2e distinguish RateLimitReached from UnknownError 2024-03-18 00:21:51 -07:00
muskit f9a9e47f7d update scraper.py 2024-03-18 00:05:46 -07:00
muskit 3eded34c0f update scraper.py 2024-03-18 00:02:21 -07:00
muskit af7ce7150b docs: update README.md 2024-03-17 23:52:00 -07:00
muskit c88ddc749a try using auth tokens on scraper 2024-03-17 23:50:17 -07:00
muskit 95f654316b update tweety commit 2024-03-17 22:07:13 -07:00
20 changed files with 179 additions and 104 deletions
+1
View File
@@ -1 +1,2 @@
./run
./.venv
+3 -2
View File
@@ -17,8 +17,9 @@ RUN pip3 install --break-system-packages -r requirements.txt
# Copy source code
COPY . .
# Mount working directory
# Mount persistent working directory
VOLUME ./run
# Run the bot
CMD ["python3", "src/main.py"]
CMD ["python3", "-u", "src/main.py"]
#CMD ["python3", "-u", "src/main.py", "--straight-to-queue"]
+24 -20
View File
@@ -5,6 +5,8 @@ Twitter bot that tracks cross-company interactions between the non-JP branches o
**This project was created to run [this account](https://twitter.com/NijiHolo_EN_ID).**
![Screenshot_20230912_235245](https://github.com/muskit/muskit/assets/15199219/0359fb26-8a48-4698-9b78-66d7d852099e)
## Running
With the way packages are setup, **you must have Docker installed and running!!**
@@ -12,11 +14,17 @@ Setup the `.env` in the project root. Refer to the [`.env`](#env) section for va
Build and run the Docker container:
```bash
# to run attached (can CTRL+P,CTRL+Q to detach)
sh run.sh
# to delete container and built image
sh scripts/delete.sh
# ... or to run headless
sh run_detached.sh
# to build image
sh scripts/build.sh
# to create container and run attached (can CTRL+P,CTRL+Q to detach)
sh scripts/run.sh
# ... or to run headless/detached
sh scripts/run_detached.sh
```
If attached to a container prepared by Dockerfile, you can run the program from project root (not in `src`). Refer to the following section for options.
@@ -36,17 +44,17 @@ These need to be defined in a `.env` file in the `run` ephemeral directory.
### Scraper Credentials
To get around rate limitations imposed on users, we scrape with multiple accounts. Each account is defined in the file using the following format:
```
scraper_usernameX=twitter_username
scraper_passwordX=twitter_password
scraperX_username=twitter_username
scraperX_password=twitter_auth_token
```
where `X` is a number starting from 0, increasing by 1 for each account added. For instance:
```
scraper_username0=
scraper_password0=
scraper_username1=
scraper_password1=
scraper0_username=
scraper0_password=
scraper1_username=
scraper1_password=
```
The first account (`scraper_username0` and `scraper_password0`) **MUST be defined (`scraper_username` and `scraper_password` without number will not work!)** and will be used to attempt scraping private accounts. Make sure this account follows any private accounts that you want to scrape!
The first account (`scraper0_username` and `scraper0_password`) **MUST be defined (`scraper_username` and `scraper_password` without number will not work!)** and will be used to attempt scraping private accounts. Make sure this account follows any private accounts that you want to scrape!
### Twitter API Stuff
The following keys/tokens are used for the official API via `tweepy`. We mainly use these to just post tweets.
```
@@ -56,20 +64,16 @@ user_token=
user_secret=
```
### Screenshot Cookie *(optional)*
This is the authentication token obtained from a browser when signed in on the Twitter website. It's only needed if you want to screenshot tweets from privated accounts. Make sure the token belongs to an account that follows desired private accounts! Maybe have it belong to `scraper_username0`?
This is the authentication token obtained from a browser when signed in on the Twitter website. It's only needed if you want to screenshot tweets from privated accounts. Make sure the token belongs to an account that follows desired private accounts! Maybe have it belong to `scraper0`?
```
web_auth_token=
```
### Example `.env` without values
```
scraper_username0=
scraper_password0=
scraper_username1=
scraper_password1=
scraper_username2=
scraper_password2=
scraper_username3=
scraper_password3=
scraper0_username=
scraper0_password=
scraper1_username=
scraper1_password=
web_auth_token=
app_key=
app_secret=
-3
View File
@@ -1,3 +0,0 @@
#!/bin/sh
docker build -t nijiholo_bot .
+1 -1
View File
@@ -1,6 +1,6 @@
python-dotenv
nest-asyncio
pytz
git+https://github.com/mahrtayyab/tweety.git@e3d330280cb3b2e8f9d2bf2f20425c476f7671a5
git+https://github.com/mahrtayyab/tweety.git
tweepy
tweet-capture
+1 -1
View File
@@ -1,7 +1,7 @@
python-dotenv
nest-asyncio
pytz
git+https://github.com/mahrtayyab/tweety.git@e3d330280cb3b2e8f9d2bf2f20425c476f7671a5
git+https://github.com/mahrtayyab/tweety.git
tweepy
tweet-capture
opencv-python-headless
-4
View File
@@ -1,4 +0,0 @@
#!/bin/sh
mkdir -p run
docker run -v ./run:/app/run --name bot -it nijiholo_bot
-4
View File
@@ -1,4 +0,0 @@
#!/bin/sh
mkdir -p run
docker run -v ./run:/app/run --name bot -d nijiholo_bot
+6
View File
@@ -0,0 +1,6 @@
#!/bin/sh
CURPATH="$(dirname `realpath "$0"`)/.."
#sudo docker build -t nijiholo_bot --no-cache "$CURPATH"
sudo docker build -t nijiholo_bot "$CURPATH"
sudo docker container create -v "$CURPATH/run:/app/run" --name bot nijiholo_bot
+4
View File
@@ -0,0 +1,4 @@
#!/bin/sh
sudo docker container rm bot
sudo docker image rm nijiholo_bot
Executable
+6
View File
@@ -0,0 +1,6 @@
#!/bin/sh
CURPATH="$(dirname `realpath "$0"`)/.."
cd "$CURPATH"
mkdir -p run
#sudo docker run -v "$CURPATH/run:/app/run" --name bot -it nijiholo_bot
sudo docker container start -a -i bot
+6
View File
@@ -0,0 +1,6 @@
#!/bin/sh
CURPATH="$(dirname `realpath "$0"`)/.."
cd "$CURPATH"
mkdir -p run
#sudo docker run -v "$CURPATH/run:/app/run" --name bot -d nijiholo_bot
sudo docker container start bot
+2
View File
@@ -0,0 +1,2 @@
#!/bin/sh
sudo docker exec -it bot sh
+3 -2
View File
@@ -11,13 +11,14 @@ class AccountPool:
creds = dotenv_values(working_path(file=".env"))
i = 0
while True:
if f"scraper_username{i}" in creds and f"scraper_password{i}" in creds:
if f"scraper{i}_username" in creds and f"scraper{i}_password" in creds:
self.__accounts.append(
(creds[f"scraper_username{i}"], creds[f"scraper_password{i}"])
(creds[f"scraper{i}_username"], creds[f"scraper{i}_password"])
)
i += 1
else:
break
print(f"{len(self.__accounts)} scraper credentials found!")
def use_index(self, idx):
self.__idx = idx
+35 -7
View File
@@ -16,8 +16,9 @@ import ttweetqueue as ttq
PROGRAM_ARGS = None
preempt_done = False
safe_to_post_tweets = True
scraper: Scraper
scraper = Scraper()
# Updates TTweetQueue
@@ -87,7 +88,7 @@ async def process_queue() -> bool:
queued_ttweets_count = queue.get_count()
WAIT_TIME = 60 * 15
WAIT_TIME = 60 * 30 # 30 minutes
ttweets_posted = 0
if queued_ttweets_count == 0:
@@ -110,7 +111,7 @@ async def process_queue() -> bool:
ttweets_posted += 1
print(f"({ttweets_posted}/{queued_ttweets_count}) done")
if not queue.is_empty():
print(f"resting for {WAIT_TIME}s...")
print(f"resting for {WAIT_TIME/60} minutes...")
await asyncio.sleep(WAIT_TIME - 5)
print("5 second warning!")
await asyncio.sleep(5)
@@ -127,13 +128,13 @@ async def process_queue() -> bool:
# return False = issue occurred where we couldn't post all past tweets properly
async def run(PROGRAM_ARGS):
global safe_to_post_tweets
global preempt_done
global scraper
global queue
scraper = Scraper()
queue = ttq.TalentTweetQueue.instance
# post tweets given in command line first
# OPTION: post tweets given in command line first
if PROGRAM_ARGS.post_id is not None and len(PROGRAM_ARGS.post_id) > 0:
PROGRAM_ARGS.post_id.sort()
print("Posting specified tweets first.")
@@ -150,11 +151,38 @@ async def run(PROGRAM_ARGS):
print("Successfully posted tweet. Sleeping for 5 minutes")
await asyncio.sleep(60 * 5)
else:
print("Did not post tweet")
print("Did not post tweet\n")
print("Done processing specified tweets")
PROGRAM_ARGS.post_id = None
# refresh stored queue first
# PREEMPT: post tweet IDs in preempt.txt if exists and not empty
if not preempt_done:
try:
with open(working_path(file="preempt.txt"), "r") as preempt_file:
print("Found preempt.txt! Posting stored IDs unconditionally...")
for l in preempt_file:
if len(l) == 0: continue
try:
id = int(l.strip().split()[0])
except:
print(f"Error occurred processing {l}, skipping...")
continue
posted = await TwAPI.instance.post_ttweet_by_id(id, PROGRAM_ARGS.dry_run)
if posted:
queue.add_finished_tweet(id)
print("Successfully posted tweet. Sleeping for 5 minutes")
await asyncio.sleep(60 * 5)
else:
print("Could not post tweet\n")
print("Finished processing preempt.txt")
preempt_done = True
except FileNotFoundError:
print("preempt.txt wasn't found")
# OPTION: refresh stored queue first
if PROGRAM_ARGS.refresh_queue:
PROGRAM_ARGS.refresh_queue = False
print("Refreshing queue tweets...")
+40 -20
View File
@@ -1,6 +1,7 @@
from os.path import exists
from time import sleep
from datetime import datetime, timedelta
import traceback
import pytz
@@ -14,8 +15,10 @@ from tweety_utils import *
from talenttweet import *
import talent_lists
# TODO: on RateLimit encounter, determine when it will probably
# unlock and wait just until then
class Scraper:
COOLDOWN = 16 # minutes
def __init__(self):
Scraper.instance = self
self.__account = AccountPool()
@@ -47,16 +50,16 @@ class Scraper:
def login_wait(self, private=False):
if private:
print(
f"keeping pvt-accessible account ({self.__account.use_index(0)[0]}). sleeping for 4 minutes..."
f"keeping pvt-accessible account ({self.__account.use_index(0)[0]}). sleeping for {Scraper.COOLDOWN} minutes..."
)
sleep(240)
sleep(60*Scraper.COOLDOWN)
print()
l = self.try_login(0)
else:
l = self.try_login()
if not l:
print("sleeping for 4 minutes...")
sleep(240)
print(f"sleeping for {Scraper.COOLDOWN} minutes...")
sleep(60*Scraper.COOLDOWN)
print()
self.try_login()
@@ -86,7 +89,7 @@ class Scraper:
return tweet
def get_tweet(self, id: int, private_user=False):
# print(f'{id}{" on private" if private_user else ""}')
# print(f'getting {id}{" on private" if private_user else ""}')
if private_user:
self.try_login(0)
while True:
@@ -96,19 +99,29 @@ class Scraper:
except RateLimitReached:
print("RateLimitReached occurred")
self.login_wait(private_user)
except UnknownError:
print("UnknownError occurred, probably rate-limited")
except UnknownError as e:
print(f"UnknownError occurred: {e.message.rstrip()}")
print(f"skipping attempt to get tweet {id}...")
return None
# if any(x in e.message.lower() for x in ["missing", "post is unavailable", "delete"]) : # tweet is probably unavailable
# print(f"tweet {id} seems unavailable; skipping...")
# return None
# if "account owner limits" in e.message.lower(): # private tweet
# print("trying again as pvt-accessible...\n")
# return self.get_tweet(id, True)
# print("treating like RateLimitReached and using the next scraper...")
# traceback.print_exc()
self.login_wait(private_user)
# self.login_wait(private_user)
except Exception as e:
if not private_user:
print("Unhandled exception occurred, trying again as private...")
return self.get_tweet(id, True)
else:
print(
f"Unhandled exception occurred, tweet {id} is probably unavailable"
)
print(e)
# if not private_user:
# print("Unhandled exception occurred getting tweet!")
# traceback.print_exc()
# print("trying again as pvt-accessible...\n")
# return self.get_tweet(id, True)
# else:
print("Unhandled exception occurred")
traceback.print_exc()
print(f"skipping tweet {id}")
return None
# since MUST BE TIMEZONE AWARE
@@ -126,7 +139,7 @@ class Scraper:
else:
print(f"grabbing tweets since {since.date()}")
uid = self.app._get_user_id(username)
uid = int(self.app._get_user_id(username))
print(f"{username} = {uid}")
def add_tweet(tweet: Tweet):
@@ -168,6 +181,8 @@ class Scraper:
for e in cur_page:
if isinstance(e, Tweet):
add_tweet(e)
if e == cur_page[-1]:
print(f"{e.date} (last tweet) < {since.date()} (since) ?")
elif isinstance(e, SelfThread):
# FIXME: rework when replied_to is fixed (currently populates user_mentions)
# latest tweet in thread = og author's reply
@@ -175,9 +190,14 @@ class Scraper:
add_tweet(t)
cur = search.cursor
except (UnknownError, RateLimitReached):
print("UnknownError occurred, probably rate-limited")
except RateLimitReached:
print("RateLimitReached occurred getting tweets from user")
self.login_wait(uid in talent_lists.privated_accounts)
except UnknownError as e:
print(f"UnknownError occurred getting tweets from user: {e.message.rstrip()}")
print("treating like RateLimitReached...")
self.login_wait(uid in talent_lists.privated_accounts)
sleep(5) # FIXME: temporary attempt to avoid scraper lock-up
tweets.sort(key=lambda t: t.id)
return tweets
+3 -1
View File
@@ -311,9 +311,11 @@ class TalentTweet:
rtm_msg(QUOTED_TWEET_MENTIONS_B, quoted_username)
else:
ret += QUOTE_TWEET.format(author_username, quoted_username)
elif len(self.mentions) > 0: # standalone tweet
elif len(self.mentions) > 0: # standalone tweet that mentions other
ret += TWEET.format(author_username, ", ".join(mention_usernames))
mention_usernames.clear()
elif len(self.rt_mentions) > 0: # reply to non-talent tweet that mentions B
rtm_msg(REPLY_TO_MENTION_B, "")
else:
raise ValueError(
f"TalentTweet {self.tweet_id} has insufficient other parties"
+2 -2
View File
@@ -73,9 +73,9 @@ class TalentTweetQueue:
for line in f:
if len(line) > 0:
ttweet = tt.TalentTweet.deserialize(line)
if ttweet.tweet_id in self.ttweets_dict:
self.ttweets_dict[ttweet.tweet_id] = ttweet
if ttweet.tweet_id not in self.ttweets_dict:
print(f"adding unfinished tweet {ttweet.tweet_id}")
self.ttweets_dict[ttweet.tweet_id] = ttweet
# finished ttweets
try:
with open(self.finished_ttweets_path, "r") as f:
+30 -28
View File
@@ -130,6 +130,7 @@ class TwAPI:
async def get_ttweet_image_media_id(self, ttweet):
img = await util.create_ttweet_image(ttweet)
print(f"obtaining media id for {img}...")
media = self.api.media_upload(img)
return media.media_id
@@ -141,6 +142,7 @@ class TwAPI:
print(
f"------{ttweet.tweet_id} ({util.get_username_local(ttweet.author_id)})------"
)
print(ttweet)
text = ttweet.announce_text()
ttweet_url = ttweet.url()
@@ -151,41 +153,41 @@ class TwAPI:
return False
# main tweet: text + screenshot
try:
# try:
print("creating main QRT w/ screenshot...")
media_ids = [await self.get_ttweet_image_media_id(ttweet)]
twt_resp = await self.post_tweet(
text, media_ids=media_ids, quote_tweet_id=ttweet.tweet_id
)
print("done")
except:
print(
"error occurred trying to create main tweet, falling back to URL-main + reply screencap format"
)
traceback.print_exc()
try:
print("posting main tweet...")
twt_resp = await self.post_tweet(text, quote_tweet_id=ttweet.tweet_id)
print("done")
twt_id = twt_resp.data["id"]
# except:
# print(
# "error occurred trying to create main tweet, falling back to URL-main + reply screencap format"
# )
# traceback.print_exc()
# try:
# print("posting main tweet...")
# twt_resp = await self.post_tweet(text, quote_tweet_id=ttweet.tweet_id)
# print("done")
# twt_id = twt_resp.data["id"]
try:
print("creating reply img...", end="")
media_ids = [await self.get_ttweet_image_media_id(ttweet)]
print("posting reply tweet...", end="")
await self.post_tweet(reply_to_tweet=twt_id, media_ids=media_ids)
print("done")
except:
print("Had trouble posting reply image tweet.")
print("successfully posted ttweet!")
except tweepy.Forbidden as e:
if "duplicate content" in e.api_messages[0]:
print(
"Twitter says the TalentTweet is a duplicate; skipping error-free..."
)
return False
else:
raise e
# try:
# print("creating reply img...", end="")
# media_ids = [await self.get_ttweet_image_media_id(ttweet)]
# print("posting reply tweet...", end="")
# await self.post_tweet(reply_to_tweet=twt_id, media_ids=media_ids)
# print("done")
# except:
# print("Had trouble posting reply image tweet.")
# print("successfully posted ttweet!")
# except tweepy.Forbidden as e:
# if "duplicate content" in e.api_messages[0]:
# print(
# "Twitter says the TalentTweet is a duplicate; skipping error-free..."
# )
# return False
# else:
# raise e
return True
async def post_ttweet_by_id(self, id: int, dry_run=False):
+5 -2
View File
@@ -83,6 +83,7 @@ async def create_ttweet_image(ttweet):
tc.driver_path = "/usr/bin/chromedriver"
filename = working_path(file="img.png")
img = None
print(f"Creating image for TalentTweet {ttweet.url()}")
try:
os.remove(filename)
except:
@@ -94,10 +95,12 @@ async def create_ttweet_image(ttweet):
mode=4,
night_mode=1,
show_parent_tweets=True,
#parent_tweets_limit=3
)
img = fix_aspect_ratio(img)
except:
print("unable to create tweet image")
except Exception as e:
print("ERROR: unable to create tweet image")
print(e)
traceback.print_exc()
return None