Update README.md

ignore privated tweets entirely (prob the reason why my scrapers get kicked)
add condition for unavailable tweet
2025-01-15 02:40:51 -08:00 · 2024-04-29 23:55:29 -07:00 · 2024-04-29 17:20:23 -07:00 · 2024-04-29 17:20:08 -07:00 · 2024-04-29 17:13:39 -07:00 · 2024-04-26 00:16:16 -07:00
20 changed files with 179 additions and 104 deletions
@@ -1 +1,2 @@
 ./run
+./.venv
@@ -17,8 +17,9 @@ RUN pip3 install --break-system-packages -r requirements.txt
 # Copy source code
 COPY . .

-# Mount working directory
+# Mount persistent working directory
 VOLUME ./run

 # Run the bot
-CMD ["python3", "src/main.py"]
+CMD ["python3", "-u", "src/main.py"]
+#CMD ["python3", "-u", "src/main.py", "--straight-to-queue"]
@@ -5,6 +5,8 @@ Twitter bot that tracks cross-company interactions between the non-JP branches o

 **This project was created to run [this account](https://twitter.com/NijiHolo_EN_ID).**

+![Screenshot_20230912_235245](https://github.com/muskit/muskit/assets/15199219/0359fb26-8a48-4698-9b78-66d7d852099e)
+
 ## Running
 With the way packages are setup, **you must have Docker installed and running!!**

@@ -12,11 +14,17 @@ Setup the `.env` in the project root. Refer to the [`.env`](#env) section for va

 Build and run the Docker container:
 ```bash
-# to run attached (can CTRL+P,CTRL+Q to detach)
-sh run.sh
+# to delete container and built image
+sh scripts/delete.sh

-# ... or to run headless
-sh run_detached.sh
+# to build image
+sh scripts/build.sh
+
+# to create container and run attached (can CTRL+P,CTRL+Q to detach)
+sh scripts/run.sh
+
+# ... or to run headless/detached
+sh scripts/run_detached.sh
 ```

 If attached to a container prepared by Dockerfile, you can run the program from project root (not in `src`). Refer to the following section for options.
@@ -36,17 +44,17 @@ These need to be defined in a `.env` file in the `run` ephemeral directory.
 ### Scraper Credentials
 To get around rate limitations imposed on users, we scrape with multiple accounts. Each account is defined in the file using the following format:
 ```
-scraper_usernameX=twitter_username
-scraper_passwordX=twitter_password
+scraperX_username=twitter_username
+scraperX_password=twitter_auth_token
 ```
 where `X` is a number starting from 0, increasing by 1 for each account added. For instance:
 ```
-scraper_username0=
-scraper_password0=
-scraper_username1=
-scraper_password1=
+scraper0_username=
+scraper0_password=
+scraper1_username=
+scraper1_password=
 ```
-The first account (`scraper_username0` and `scraper_password0`) **MUST be defined (`scraper_username` and `scraper_password` without number will not work!)**  and will be used to attempt scraping private accounts. Make sure this account follows any private accounts that you want to scrape!
+The first account (`scraper0_username` and `scraper0_password`) **MUST be defined (`scraper_username` and `scraper_password` without number will not work!)**  and will be used to attempt scraping private accounts. Make sure this account follows any private accounts that you want to scrape!
 ### Twitter API Stuff
 The following keys/tokens are used for the official API via `tweepy`. We mainly use these to just post tweets.
 ```
@@ -56,20 +64,16 @@ user_token=
 user_secret=
 ```
 ### Screenshot Cookie *(optional)*
-This is the authentication token obtained from a browser when signed in on the Twitter website. It's only needed if you want to screenshot tweets from privated accounts. Make sure the token belongs to an account that follows desired private accounts! Maybe have it belong to `scraper_username0`?
+This is the authentication token obtained from a browser when signed in on the Twitter website. It's only needed if you want to screenshot tweets from privated accounts. Make sure the token belongs to an account that follows desired private accounts! Maybe have it belong to `scraper0`?
 ```
 web_auth_token=
 ```
 ### Example `.env` without values
 ```
-scraper_username0=
-scraper_password0=
-scraper_username1=
-scraper_password1=
-scraper_username2=
-scraper_password2=
-scraper_username3=
-scraper_password3=
+scraper0_username=
+scraper0_password=
+scraper1_username=
+scraper1_password=
 web_auth_token=
 app_key=
 app_secret=
@@ -1,3 +0,0 @@
-#!/bin/sh
-
-docker build -t nijiholo_bot .
@@ -1,6 +1,6 @@
 python-dotenv
 nest-asyncio
 pytz
-git+https://github.com/mahrtayyab/tweety.git@e3d330280cb3b2e8f9d2bf2f20425c476f7671a5
+git+https://github.com/mahrtayyab/tweety.git
 tweepy
 tweet-capture
@@ -1,7 +1,7 @@
 python-dotenv
 nest-asyncio
 pytz
-git+https://github.com/mahrtayyab/tweety.git@e3d330280cb3b2e8f9d2bf2f20425c476f7671a5
+git+https://github.com/mahrtayyab/tweety.git
 tweepy
 tweet-capture
 opencv-python-headless
@@ -1,4 +0,0 @@
-#!/bin/sh
-
-mkdir -p run
-docker run -v ./run:/app/run --name bot -it nijiholo_bot
@@ -1,4 +0,0 @@
-#!/bin/sh
-
-mkdir -p run
-docker run -v ./run:/app/run --name bot -d nijiholo_bot
@@ -0,0 +1,6 @@
+#!/bin/sh
+
+CURPATH="$(dirname `realpath "$0"`)/.."
+#sudo docker build -t nijiholo_bot --no-cache "$CURPATH"
+sudo docker build -t nijiholo_bot "$CURPATH"
+sudo docker container create -v "$CURPATH/run:/app/run" --name bot nijiholo_bot
@@ -0,0 +1,4 @@
+#!/bin/sh
+
+sudo docker container rm bot
+sudo docker image rm nijiholo_bot
@@ -0,0 +1,6 @@
+#!/bin/sh
+CURPATH="$(dirname `realpath "$0"`)/.."
+cd "$CURPATH"
+mkdir -p run
+#sudo docker run -v "$CURPATH/run:/app/run" --name bot -it nijiholo_bot
+sudo docker container start -a -i bot
@@ -0,0 +1,6 @@
+#!/bin/sh
+CURPATH="$(dirname `realpath "$0"`)/.."
+cd "$CURPATH"
+mkdir -p run
+#sudo docker run -v "$CURPATH/run:/app/run" --name bot -d nijiholo_bot
+sudo docker container start bot
@@ -0,0 +1,2 @@
+#!/bin/sh
+sudo docker exec -it bot sh
@@ -11,13 +11,14 @@ class AccountPool:
        creds = dotenv_values(working_path(file=".env"))
        i = 0
        while True:
-            if f"scraper_username{i}" in creds and f"scraper_password{i}" in creds:
+            if f"scraper{i}_username" in creds and f"scraper{i}_password" in creds:
                self.__accounts.append(
-                    (creds[f"scraper_username{i}"], creds[f"scraper_password{i}"])
+                    (creds[f"scraper{i}_username"], creds[f"scraper{i}_password"])
                )
                i += 1
            else:
                break
+        print(f"{len(self.__accounts)} scraper credentials found!")

    def use_index(self, idx):
        self.__idx = idx
@@ -16,8 +16,9 @@ import ttweetqueue as ttq

 PROGRAM_ARGS = None

+preempt_done = False
 safe_to_post_tweets = True
-scraper: Scraper
+scraper = Scraper()


 # Updates TTweetQueue
@@ -87,7 +88,7 @@ async def process_queue() -> bool:

    queued_ttweets_count = queue.get_count()

-    WAIT_TIME = 60 * 15
+    WAIT_TIME = 60 * 30 # 30 minutes
    ttweets_posted = 0

    if queued_ttweets_count == 0:
@@ -110,7 +111,7 @@ async def process_queue() -> bool:
                ttweets_posted += 1
                print(f"({ttweets_posted}/{queued_ttweets_count}) done")
                if not queue.is_empty():
-                    print(f"resting for {WAIT_TIME}s...")
+                    print(f"resting for {WAIT_TIME/60} minutes...")
                    await asyncio.sleep(WAIT_TIME - 5)
                    print("5 second warning!")
                    await asyncio.sleep(5)
@@ -127,13 +128,13 @@ async def process_queue() -> bool:
 # return False = issue occurred where we couldn't post all past tweets properly
 async def run(PROGRAM_ARGS):
    global safe_to_post_tweets
+    global preempt_done
    global scraper
    global queue

-    scraper = Scraper()
    queue = ttq.TalentTweetQueue.instance

-    # post tweets given in command line first
+    # OPTION: post tweets given in command line first
    if PROGRAM_ARGS.post_id is not None and len(PROGRAM_ARGS.post_id) > 0:
        PROGRAM_ARGS.post_id.sort()
        print("Posting specified tweets first.")
@@ -150,11 +151,38 @@ async def run(PROGRAM_ARGS):
                print("Successfully posted tweet. Sleeping for 5 minutes")
                await asyncio.sleep(60 * 5)
            else:
-                print("Did not post tweet")
+                print("Did not post tweet\n")
        print("Done processing specified tweets")
        PROGRAM_ARGS.post_id = None

-    # refresh stored queue first
+    # PREEMPT: post tweet IDs in preempt.txt if exists and not empty
+    if not preempt_done:
+        try:
+            with open(working_path(file="preempt.txt"), "r") as preempt_file:
+                print("Found preempt.txt! Posting stored IDs unconditionally...")
+
+                for l in preempt_file:
+                    if len(l) == 0: continue
+                    try:
+                        id = int(l.strip().split()[0])
+                    except:
+                        print(f"Error occurred processing {l}, skipping...")
+                        continue
+
+                    posted = await TwAPI.instance.post_ttweet_by_id(id, PROGRAM_ARGS.dry_run)
+                    if posted:
+                        queue.add_finished_tweet(id)
+                        print("Successfully posted tweet. Sleeping for 5 minutes")
+                        await asyncio.sleep(60 * 5)
+                    else:
+                        print("Could not post tweet\n")
+
+                print("Finished processing preempt.txt")
+                preempt_done = True
+        except FileNotFoundError:
+            print("preempt.txt wasn't found")
+
+    # OPTION: refresh stored queue first
    if PROGRAM_ARGS.refresh_queue:
        PROGRAM_ARGS.refresh_queue = False
        print("Refreshing queue tweets...")
@@ -1,6 +1,7 @@
 from os.path import exists
 from time import sleep
 from datetime import datetime, timedelta
+import traceback

 import pytz

@@ -14,8 +15,10 @@ from tweety_utils import *
 from talenttweet import *
 import talent_lists

-
+# TODO: on RateLimit encounter, determine when it will probably
+# unlock and wait just until then
 class Scraper:
+    COOLDOWN = 16 # minutes
    def __init__(self):
        Scraper.instance = self
        self.__account = AccountPool()
@@ -47,16 +50,16 @@ class Scraper:
    def login_wait(self, private=False):
        if private:
            print(
-                f"keeping pvt-accessible account ({self.__account.use_index(0)[0]}). sleeping for 4 minutes..."
+                f"keeping pvt-accessible account ({self.__account.use_index(0)[0]}). sleeping for {Scraper.COOLDOWN} minutes..."
            )
-            sleep(240)
+            sleep(60*Scraper.COOLDOWN)
            print()
            l = self.try_login(0)
        else:
            l = self.try_login()
        if not l:
-            print("sleeping for 4 minutes...")
-            sleep(240)
+            print(f"sleeping for {Scraper.COOLDOWN} minutes...")
+            sleep(60*Scraper.COOLDOWN)
            print()
            self.try_login()

@@ -86,7 +89,7 @@ class Scraper:
        return tweet

    def get_tweet(self, id: int, private_user=False):
-        # print(f'{id}{" on private" if private_user else ""}')
+        # print(f'getting {id}{" on private" if private_user else ""}')
        if private_user:
            self.try_login(0)
        while True:
@@ -96,19 +99,29 @@ class Scraper:
            except RateLimitReached:
                print("RateLimitReached occurred")
                self.login_wait(private_user)
-            except UnknownError:
-                print("UnknownError occurred, probably rate-limited")
+            except UnknownError as e:
+                print(f"UnknownError occurred: {e.message.rstrip()}")
+                print(f"skipping attempt to get tweet {id}...")
+                return None
+                # if any(x in e.message.lower() for x in ["missing", "post is unavailable", "delete"]) :  # tweet is probably unavailable
+                #     print(f"tweet {id} seems unavailable; skipping...")
+                #     return None
+                # if "account owner limits" in e.message.lower(): # private tweet
+                #     print("trying again as pvt-accessible...\n")
+                #     return self.get_tweet(id, True)
+                # print("treating like RateLimitReached and using the next scraper...")
                # traceback.print_exc()
-                self.login_wait(private_user)
+                # self.login_wait(private_user)
            except Exception as e:
-                if not private_user:
-                    print("Unhandled exception occurred, trying again as private...")
-                    return self.get_tweet(id, True)
-                else:
-                    print(
-                        f"Unhandled exception occurred, tweet {id} is probably unavailable"
-                    )
-                    print(e)
+                # if not private_user:
+                #     print("Unhandled exception occurred getting tweet!")
+                #     traceback.print_exc()
+                #     print("trying again as pvt-accessible...\n")
+                #     return self.get_tweet(id, True)
+                # else:
+                    print("Unhandled exception occurred")
+                    traceback.print_exc()
+                    print(f"skipping tweet {id}")
                    return None

    # since MUST BE TIMEZONE AWARE
@@ -126,7 +139,7 @@ class Scraper:
        else:
            print(f"grabbing tweets since {since.date()}")

-        uid = self.app._get_user_id(username)
+        uid = int(self.app._get_user_id(username))
        print(f"{username} = {uid}")

        def add_tweet(tweet: Tweet):
@@ -168,6 +181,8 @@ class Scraper:
                for e in cur_page:
                    if isinstance(e, Tweet):
                        add_tweet(e)
+                        if e == cur_page[-1]:
+                            print(f"{e.date} (last tweet) < {since.date()} (since) ?")
                    elif isinstance(e, SelfThread):
                        # FIXME: rework when replied_to is fixed (currently populates user_mentions)
                        # latest tweet in thread = og author's reply
@@ -175,9 +190,14 @@ class Scraper:
                            add_tweet(t)

                cur = search.cursor
-            except (UnknownError, RateLimitReached):
-                print("UnknownError occurred, probably rate-limited")
+            except RateLimitReached:
+                print("RateLimitReached occurred getting tweets from user")
                self.login_wait(uid in talent_lists.privated_accounts)
+            except UnknownError as e:
+                print(f"UnknownError occurred getting tweets from user: {e.message.rstrip()}")
+                print("treating like RateLimitReached...")
+                self.login_wait(uid in talent_lists.privated_accounts)
+            sleep(5) # FIXME: temporary attempt to avoid scraper lock-up

        tweets.sort(key=lambda t: t.id)
        return tweets
@@ -311,9 +311,11 @@ class TalentTweet:
                rtm_msg(QUOTED_TWEET_MENTIONS_B, quoted_username)
            else:
                ret += QUOTE_TWEET.format(author_username, quoted_username)
-        elif len(self.mentions) > 0:  # standalone tweet
+        elif len(self.mentions) > 0:  # standalone tweet that mentions other
            ret += TWEET.format(author_username, ", ".join(mention_usernames))
            mention_usernames.clear()
+        elif len(self.rt_mentions) > 0: # reply to non-talent tweet that mentions B
+            rtm_msg(REPLY_TO_MENTION_B, "")
        else:
            raise ValueError(
                f"TalentTweet {self.tweet_id} has insufficient other parties"
@@ -73,9 +73,9 @@ class TalentTweetQueue:
                for line in f:
                    if len(line) > 0:
                        ttweet = tt.TalentTweet.deserialize(line)
-                        if ttweet.tweet_id in self.ttweets_dict:
-                            self.ttweets_dict[ttweet.tweet_id] = ttweet
+                        if ttweet.tweet_id not in self.ttweets_dict:
                            print(f"adding unfinished tweet {ttweet.tweet_id}")
+                            self.ttweets_dict[ttweet.tweet_id] = ttweet
        # finished ttweets
        try:
            with open(self.finished_ttweets_path, "r") as f:
@@ -130,6 +130,7 @@ class TwAPI:

    async def get_ttweet_image_media_id(self, ttweet):
        img = await util.create_ttweet_image(ttweet)
+        print(f"obtaining media id for {img}...")
        media = self.api.media_upload(img)
        return media.media_id

@@ -141,6 +142,7 @@ class TwAPI:
        print(
            f"------{ttweet.tweet_id} ({util.get_username_local(ttweet.author_id)})------"
        )
+        print(ttweet)

        text = ttweet.announce_text()
        ttweet_url = ttweet.url()
@@ -151,41 +153,41 @@ class TwAPI:
            return False

        # main tweet: text + screenshot
-        try:
+        # try:
        print("creating main QRT w/ screenshot...")
        media_ids = [await self.get_ttweet_image_media_id(ttweet)]
        twt_resp = await self.post_tweet(
            text, media_ids=media_ids, quote_tweet_id=ttweet.tweet_id
        )
        print("done")
-        except:
-            print(
-                "error occurred trying to create main tweet, falling back to URL-main + reply screencap format"
-            )
-            traceback.print_exc()
-            try:
-                print("posting main tweet...")
-                twt_resp = await self.post_tweet(text, quote_tweet_id=ttweet.tweet_id)
-                print("done")
-                twt_id = twt_resp.data["id"]
+        # except:
+        #     print(
+        #         "error occurred trying to create main tweet, falling back to URL-main + reply screencap format"
+        #     )
+        #     traceback.print_exc()
+        #     try:
+        #         print("posting main tweet...")
+        #         twt_resp = await self.post_tweet(text, quote_tweet_id=ttweet.tweet_id)
+        #         print("done")
+        #         twt_id = twt_resp.data["id"]

-                try:
-                    print("creating reply img...", end="")
-                    media_ids = [await self.get_ttweet_image_media_id(ttweet)]
-                    print("posting reply tweet...", end="")
-                    await self.post_tweet(reply_to_tweet=twt_id, media_ids=media_ids)
-                    print("done")
-                except:
-                    print("Had trouble posting reply image tweet.")
-                print("successfully posted ttweet!")
-            except tweepy.Forbidden as e:
-                if "duplicate content" in e.api_messages[0]:
-                    print(
-                        "Twitter says the TalentTweet is a duplicate; skipping error-free..."
-                    )
-                    return False
-                else:
-                    raise e
+        #         try:
+        #             print("creating reply img...", end="")
+        #             media_ids = [await self.get_ttweet_image_media_id(ttweet)]
+        #             print("posting reply tweet...", end="")
+        #             await self.post_tweet(reply_to_tweet=twt_id, media_ids=media_ids)
+        #             print("done")
+        #         except:
+        #             print("Had trouble posting reply image tweet.")
+        #         print("successfully posted ttweet!")
+        #     except tweepy.Forbidden as e:
+        #         if "duplicate content" in e.api_messages[0]:
+        #             print(
+        #                 "Twitter says the TalentTweet is a duplicate; skipping error-free..."
+        #             )
+        #             return False
+        #         else:
+        #             raise e
        return True

    async def post_ttweet_by_id(self, id: int, dry_run=False):
@@ -83,6 +83,7 @@ async def create_ttweet_image(ttweet):
        tc.driver_path = "/usr/bin/chromedriver"
    filename = working_path(file="img.png")
    img = None
+    print(f"Creating image for TalentTweet {ttweet.url()}")
    try:
        os.remove(filename)
    except:
@@ -94,10 +95,12 @@ async def create_ttweet_image(ttweet):
            mode=4,
            night_mode=1,
            show_parent_tweets=True,
+            #parent_tweets_limit=3
        )
        img = fix_aspect_ratio(img)
-    except:
-        print("unable to create tweet image")
+    except Exception as e:
+        print("ERROR: unable to create tweet image")
+        print(e)
        traceback.print_exc()
        return None
Author	SHA1	Message	Date
Alex	387237d017	Update README.md	2025-01-15 02:40:51 -08:00
muskit	cb921ee911	ignore privated tweets entirely (prob the reason why my scrapers get kicked)	2024-04-29 23:55:29 -07:00
muskit	85b4bfe939	add condition for unavailable tweet	2024-04-29 17:20:23 -07:00
muskit	661c9232a3	switch to latest tweety no matter what ¯\_(ツ)_/¯	2024-04-29 17:20:08 -07:00
muskit	8ba0394e9c	add no-cache build toggle line	2024-04-29 17:13:39 -07:00
muskit	6b585ad96a	print ttweet object when posting	2024-04-26 00:16:16 -07:00
muskit	eadf130305	fix logic error of handling finished tweets	2024-04-26 00:15:45 -07:00
muskit	24877eb53e	add cross-agency case	2024-04-26 00:15:32 -07:00
muskit	8e163e26cf	extend wait to try and avoid revocation (didnt work)	2024-04-26 00:14:46 -07:00
muskit	f035896226	add script to access shell in container	2024-04-26 00:13:35 -07:00
muskit	034c71abbb	use unbuffered stdout to fix docker logging	2024-04-26 00:13:24 -07:00
muskit	bfc9066617	remove backup tweet format that excludes screenshot	2024-04-21 16:05:37 -07:00
muskit	0203578987	add preempt tweet list handling	2024-04-21 16:04:33 -07:00
muskit	99155fdb37	add .venv to avoid long image rebuilds	2024-04-21 16:03:32 -07:00
muskit	d05da5bff0	add screenshots parent tweet limit	2024-04-20 12:28:39 -07:00
muskit	61a19e3fe2	revise scripts	2024-04-20 12:28:22 -07:00
muskit	d5d8db272f	update run.sh	2024-04-20 01:41:54 -07:00
muskit	3d96e8532a	update requirements_dev.txt	2024-04-20 01:41:42 -07:00
muskit	7fc86543e6	temporary patch for scraper lock outs	2024-04-20 01:41:27 -07:00
muskit	bacc426a6d	go back to passwords, use diff format	2024-04-19 23:30:28 -07:00
muskit	18dfb0a7c9	update tweety	2024-04-19 23:25:16 -07:00
muskit	1ca9fce722	move scripts into its own folder	2024-04-19 23:05:13 -07:00
muskit	f1eace1f63	update scraper.py	2024-03-18 00:53:40 -07:00
muskit	608e712bce	update scraper.py	2024-03-18 00:43:42 -07:00
muskit	805a1355fa	update scraper.py	2024-03-18 00:42:47 -07:00
muskit	8f0825ec2e	distinguish RateLimitReached from UnknownError	2024-03-18 00:21:51 -07:00
muskit	f9a9e47f7d	update scraper.py	2024-03-18 00:05:46 -07:00
muskit	3eded34c0f	update scraper.py	2024-03-18 00:02:21 -07:00
muskit	af7ce7150b	docs: update README.md	2024-03-17 23:52:00 -07:00
muskit	c88ddc749a	try using auth tokens on scraper	2024-03-17 23:50:17 -07:00
muskit	95f654316b	update tweety commit	2024-03-17 22:07:13 -07:00