Simplify Law: Using RAGs to Unleash Legal Discovery
Diving into the rich tapestry of India's legislative history, we can see a clear vision that our leaders have for our nation. In an attempt to promote transparency, this project aims pave way for LLMs to answer questions on Indian law.
Law and code. Both create order from chaos and guide their systems to useful outcomes.
Update: As of 24th December 2023, the scrapy code no longer works since the robots.txt doesn't allow scraping now. We'll have to find a better way to get this data
This blog is divided into 2 parts. One deals with a simple(!) scraping job using ScraPy. And the second deals with using Retrieval-Augmented Generation to query our legal data and answer questions posed by our user.
I. Scraping legal document stores
We've used scrapy to retrieve document urls for laws, acts and other regulatory documents. While libraries like BeautifulSoup and Selenium can be leveraged to achieve similar ends, the benefits offered by scrapy has made it indispensable to information extraction tasks from the internet. Some of these are:
- Scraping Speed and Parallelization: Being asynchronous, Scrapy supports parallelization by default making it the clear library of choice for speedy extraction.
- Memory usage: Scrapy uses fewer system resources leading to lower memory utilisation
- Support for Extensions and Middleware: Scrapy supports numerous middleware and extensions making it possible to write a more foolproof and efficient crawler.
Installation and Setup
Documentation is your friend! Installing scrapy is pretty straightforward. Scrapy requires Python 3.8+ and can be installed using PyPI with the below command:
pip install Scrapy
A better approach would be to install Scrapy in a dedicated virtualenv in a project directory created towards this end.
Viola! We're now ready to create our Scrapy project and scrape and extract links to the documents recording all acts by the legislative assemblies of India.
Creating a Project
Navigate to the directory where you'd like to create your scrapy project. Now open the command line terminal and type in:
scrapy startproject india_code
This will create a directory structure that resembles:
india_code/
scrapy.cfg # deploy configuration file
india_code/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Spiders
Why spiders? Why couldn't it have been follow the butterflies?
Ron Weasley
Directly quoting from the documentation:
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
We can define our spider in the spiders
directory under a new file central_acts.py
as follows:
Alternatively, we can simplify running our spider via the following script named crawl_script.py
in the top level directory:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from india_code.spiders.central_acts import CentralActsSpider as SpiderCA
def run_spiders():
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderCA)
process.start()
if __name__ == "__main__":
run_spiders()
Run it using:
python crawl_script.py
Now to scrape state laws, we add the following spider
We can modify the run script to crawl both:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from india_code.spiders.central_acts import CentralActsSpider as SpiderCA
from india_code.spiders.state_acts import StateActsSpider as SpiderSA
def run_spiders():
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderCA)
process.crawl(SpiderSA)
process.start()
if __name__ == "__main__":
run_spiders()
And that's it! With this we're through with part 1. Of course, we'll have to run the crawl_script.py
for a couple of hours to collect all the data.
II. Retrieval-Augmented Generation
The previous blog on langchain will help here. LLMs are frozen in time. Especially when we don't intend to fine-tune them on our data. And we can't go on retraining our model every time a new document is added. So then how do we incorporate new information in our model? This is where retrieval augmentation steps in. This technique allows us to retrieve relevant inforamtion from an external knowledge base and provide this to our LLM. Thus we will first need to create our knowledge base and then using it to enable our LLM to answer our question using information that we retrieve.
Creating our Knowledge Base
In the LLM world we use vector databases to store our data. As the name suggests, vector databases are simply databases storing vector embeddings that allow us to scale our search of similar embeddings to billions of records. We'll be using pinecone as our database. So how do we create our vector embeddings? This will be the subject of our next section.
From Documents to Vector Embeddings
First, we need to instantiate an object to split lengthy, verbose sections into bite-sized chunks for our LLM to be effective. We can use LangChain's RecursiveCharacterTextSplitter
to achieve this end. Use OpenAI embeddings to translate our sentences into vector embeddings. Also create a vector index on Pinecone and copy over the API key to our environment.
def tiktoken_len(text):
tokenizer = tiktoken.get_encoding('cl100k_base')
tokens = tokenizer.encode(
text,
disallowed_special=()
)
return len(tokens)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20,
length_function=tiktoken_len,
separators=["\n\n", "\n", " ", ""]
)
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_ENV
)
index = pinecone.Index('law-search')
embed = OpenAIEmbeddings()
What we'll aim to do is push in batches to avoid overwhelming the pinecone database with too many tokens in one batch. This step might be a tad expensive so tread carefully. Since we simply wanted to build a proof-of-concept, I used a small subset (<1%) of all the documents we had collated.
def load_chunk_persist_pdf(constitution_path='india_code/data/constitution.csv'):
batch_limit = 64
texts = []
metadatas = []
df = pd.read_csv(constitution_path)
for i in tqdm(range(df.shape[0])):
loader = PyPDFLoader(df.iloc[i,1])
contents = "".join(page.page_content for page in loader.load())
record_texts = text_splitter.split_text(contents)
record_metadatas = [{
"chunk": j, "text": text, "source": f'{df.iloc[i,0]}, part {j}'
} for j, text in enumerate(record_texts)]
texts.extend(record_texts)
metadatas.extend(record_metadatas)
if len(texts) >= batch_limit:
ids = [str(uuid4()) for _ in range(len(texts))]
embeds = embed.embed_documents(texts)
for k in range(0, len(ids), batch_limit):
stop_index = min(k+batch_limit, len(ids))
index.upsert(vectors=zip(ids[k:stop_index], embeds[k:stop_index], metadatas[k:stop_index]))
texts = []
metadatas = []
if texts:
ids = [str(uuid4()) for _ in range(len(texts))]
embeds = embed.embed_documents(texts)
index.upsert(vectors=zip(ids, embeds, metadatas))
Generative Question Answering
Now this is one of the simplest sections. We'll use our vector index to fetch a number of similar embeddings from our database. The LLM will serve to finally form these into a coherent answer to our query. The code for this is simple (use PromptTemplate
to guardrail answers).
vectorstore = Pinecone(
index,
embed.embed_query,
'text'
)
llm = ChatOpenAI()
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k":15})
)
qa.run(<YOUR QUERY HERE>)
Conclusion
So now we have the tools and a basic mental framework on how RAG works. Clearly there's a ton of things we could do from tuning the "voodoo" constants to more complex experimentations with model retraining and the works. What's important is we've kickstarted our journey. And its going to be really exciting to see what we build next.