We’re going to use Langchain + Deep Lake with GPT4 to analyze the Twitter code base!

Feel free to copy my Google Colab notebook!
You’ve probably heard of twitter, and known as one of the worlds largest social media platforms. Their algorithm is a key component of twitters functionality, influencing what content users see on their timelines and what tweets gain popularity. We will use Langchain, Deep Lake, and GPT4 to analyze source code of the algorithm.
LangChain is a python library that provides a framework to work with different LLM doing text-classification, text generation, and text summarization. Deep Lake on the other hand is a multi-modal vector store API that is used to store and retrieve high-dimensional data such as embeddings.
Plan
We will start by indexing the code base of the Twitter algorithm. We will clone the Twitter algorithm repository then parse and chunk the code base. We will then use OpenAI indexing to index the code base. Which may take up to 4 minutes to compute embeddings and upload them to Activeloop.
After indexing, we will use the indexed dataset to perform question answering on the Twitter algorithm codebase. We will construct a retriever and a Conversational Chain. The retriever is a search engine that retrieves relevant information from the indexed dataset based on a given query. The Conversational Chain is a tool for generating natural language responses to a given query. We will use GPT4 as our Conversational Chain model.
To start, please visit Google Colab and open a new workbook!
!python3 -m pip install --upgrade langchain deeplake openai tiktoken
Define OpenAI embeddings, Deep lake multi-modal vector store api authenticate.
import os
import getpass
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')
embeddings = OpenAIEmbeddings()
Index Code Base (Optional)
!git clone https://github.com/twitter/the-algorithm # replace any repository of your choice
Load all files inside the repo!
import os
from langchain.document_loaders import TextLoader
root_dir = './the-algorithm'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
docs.extend(loader.load_and_split())
except Exception as e:
pass
Chunk Files
import os
from langchain.document_loaders import TextLoader
root_dir = './the-algorithm'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
docs.extend(loader.load_and_split())
except Exception as e:
pass
Execute the indexing!
db = DeepLake.from_documents(texts, embeddings, dataset_path="hub://davitbun/twitter-algorithm")
LETS HAVE SOME FUN — Q&A The Twitter Algo / codebase
First load the dataset, construct the retriever, then construct the Conversational Chain.
db = DeepLake(dataset_path="hub://davitbun/twitter-algorithm", read_only=True, embedding_function=embeddings)
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/twitter-algorithm
You should see something like the following:

retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 20
- You can also specify user defined functions with Deep lake like so:
def filter(x):
# filter based on source code
if 'com.google' in x['text'].data()['value']:
return False
# filter based on path e.g. extension
metadata = x['metadata'].data()['value']
return 'scala' in metadata['source'] or 'py' in metadata['source']
### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter
If you have access to GPT4 then use it! If not use GPT-3.5-TURBO.
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
model = ChatOpenAI(model='gpt-4') # 'gpt-3.5-turbo',
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)
Questions to ask the twitter algo! You can modify or ask any question you see fit 🙂
[ FYI — You may need to remove 2–4 questions when using 3.5-Turbo ]
questions = [
"is it Likes + Bookmarks, or not clear from the code?",
"What are the major negative modifiers that lower your linear ranking parameters?",
"How do you get assigned to SimClusters?",
"What is needed to migrate from one SimClusters to another SimClusters?",
"How much do I get boosted within my cluster?",
"How does Heavy ranker work. what are it’s main inputs?",
"How can one influence Heavy ranker?",
"why threads and long tweets do so well on the platform?",
"Are thread and long tweet creators building a following that reacts to only threads?",
"Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?",
"Content meta data and how it impacts virality (e.g. ALT in images).",
"What are some unexpected fingerprints for spam factors?",
"Is there any difference between company verified checkmarks and blue verified individual checkmarks?",
]
chat_history = []
for question in questions:
result = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, result['answer']))
print(f"-> **Question**: {question} \n")
print(f"**Answer**: {result['answer']} \n")
Conclusion
This is awesome, and I would love to put a UI on this to let others play around with any Github repo or codebase!
END OF BLOG
LETS CONNECT!!
Have a great week everyone!
First Youtube Video Published! Youtube Link
Follow me on: Twitter, Linkedin and bookmark AIapplicationsblog.com
References:
- Google Colab Notebook
- Langchain
- ActiveLoop
- Twitter github repo
- Langchain Docs