CSCI 572: Information Retrieval and Web Search Engines - HW: LLMs, vectors, RAG

Chat with a Specialist

HW: LLMs, vectors, RAG :)
Assignment Writing Service

Summary

In this final HW, you will:

use Weaviate [ ], which is a vector DB - stores data as vectors after vectorizing, and computes a search query by vectorizing it and does

similarity search with existing vectors

crawl the web using a Node package, to compile a 'knowledge base' [to use subsequently (not part of the hw) as input to build a (!)]

using a Python module, perform RAG [retrieval augmentation] on a 'small', locally-hosted LLM [make that an 'S'LM :)]

use to run RAG on their CPU+GPU platform

These are cutting-edge techniques to know, from a POV :) Plus, they are simply, FUN!!

Please make sure you have these installed, before starting: git, Docker, Node, Python (or Conda/Anaconda), [with 'Desktop development with C++'

checked].

Note: you need to do all four, Q1..Q4 (not pick just one!) :)

Q1.

Description

We are going to use vector-based similarity search, to retrieve search results that are not keyword-driven.

The (three) steps we need are really simple:

install Weaviate plus vectorizer via Docker as images, run them as containers

specify a schema for data, upload data/knowledge (in .json format) to have it be vectorized

run a query (which also gets vectorized and then sim-searched), get back results (as JSON)

The following sections describe the above steps.

1. Installing Weaviate and a vectorizer module

After installing Docker, bring it up (eg. on Windows, run Docker Desktop). Then, in your (ana)conda shell, run this docker-compose

command that uses this 'docker-compose.yml' confi g fi le to pull in two images: the 'weaviate' one, and a text2vec transformer

called 't2v-transformers':

docker-compose up -d

https://weaviate.io/

custom GPT

https://lightning.ai

future/career

VS 2022

These screenshots show the progress, completion, and subsequently, two containers automatically being started (one for weaviate,

one for t2v-transformers):2024/4/28 05:55

Yeay! Now we have the vectorizer transformer (to convert sentences to vectors), and weaviate (our vector DB search engine)

running! On to data handling :)

2. Loading data to search for

This is the data (knowledge, aka external memory, ie. prompt augmentation source) that we'd like searched, part of which will get

returned to us as results. The data is represented as an array of JSON documents. is our data file, conveniently named

data.json (you can rename it if you like) [you can visualize it better using ] - place it in the 'root' directory of

your webserver (see below). As you can see, each datum/'row'/JSON contains three k:v pairs, with 'Category', 'Question', 'Answer' as

keys - as you might guess, it seems to be in Jeopardy(TM) answer-question (reversed) format :) The fi le is actually called

, I simply made a local copy called data.json.

The overall idea is this: we'd get the 10 documents vectorized, then specify a query word, eg. 'biology', and automagically have that

pull up related docs, eg. the 'DNA' one (even if the search result doesn't contain 'biology' in it)! This is a really useful semantic search

feature where we don't need to specify exact keywords to search for.

Start by installing the weaviate Python client:

pip install weaviate-client

So, how to submit our JSON data, to get it vectorized? Simply use Python script, do:

python weave-loadData.py

Here

https://jsoncrack.com

jeopardy-

tiny.json

You will see this:

If you look in the script, you'll see that we are creating a schema - we create a class called 'SimSearch' (you can call it something else

if you like). The data we load into the DB, will be associated with this class (the last line in the script does this via add_data_object()).

NOTE - you NEED to run a local webserver [in a separate ana/conda (or other) shell], eg. via python - it's what will 'serve'

data.json to weaviate :)

Great! Now we have specifi ed our searchable data, which has been fi rst vectorized (by 't2v-transformers'), then stored as vectors (in

weaviate).

Only one thing left: querying!

3. Querying our vectorized data

To query, use this simple shell script called , and run this:

sh weave-doQuery.sh

As you can see in the script, we search for 'physics'-related docs, and sure enough, that's what we get:

'serveit.py'

\ this exciting? Because the word 'physics' isn't in any of our results!

Now it's your turn:

• fi rst, MODIFY the contents of data.json, to replace the 10 docs in it, with your own data, where you'd replace ("Category","Question","Answer") with

ANYTHING you like, eg. ("Author","Book","Summary"), ("MusicGenre","SongTitle","Artist"), ("School","CourseName","CourseDesc"), etc, etc - HAVE fun

coming up with this! You can certainly add more docs, eg. have 20 of them instead of 10

• next, MODIFY the query keyword(s) in the query .sh fi le - eg. you can query for 'computer science' courses, 'female' singer, 'American' books,

['Indian','Chinese'] food dishes (the query list can contain multiple items), etc. Like in the above screenshot, 'cat' the query, then run it, and get a

screenshot to submit. BE SURE to also modify the data loader .py script, to put in your keys (instead of ("Category","Question","Answer"))

That's it, you're done :) In RL you will have a .json or fi le (or data in other formats) with BILLIONS of items! Later, do feel free to

play with bigger JSON fi les, eg. this Jeopardy JSON fi le :)

FYI/'extras'

Here are two more things you can do, via 'curl':

[you can also do ' ' in your browser]

.csv

200K

http://localhost:8080/v1/meta

http://localhost:8080/v1/schema2024/4/28 05:55

HW: LLMs!

https://bytes.usc.edu/cs572/s24-s-e-a-r-c-hhh/hw/HW4/index.html

6/16

Weaviate has a cloud version too, called - you can try that as an alternative to using the Dockerized version:

Run :)

Also, for fun, see if you can print the raw vectors for the data (the 10 docs)...

More info:

•

Q2.

You are going to run a crawler on a set of pages that you know contain 'good' data - that could be used by an LLM to answer

questions 'intelligently' (ie. not confabulate, ie not 'hallucinate', ie. not make up BS based on its core, general-purpose pre-training!).

The crawled results get conveniently packaged into a single output.json fi le. For this qn, please specify what group of pages you

WCS

this

https://weaviate.io/developers/weaviate/quickstart/end-to-end

https://weaviate.io/developers/weaviate/installation/docker-compose

https://medium.com/semi-technologies/what-weaviate-users-should-know-about-docker-containers-1601c6afa079

https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-transformers2024/4/28 05:55

crawled [you can pick any that you like], and, submit your output.json (see below for how to generate it).

Take a look:

You'll need to git-clone 'gpt-crawler' from . Then do 'npm install' to download the needed

Node packages. Then edit confi g.ts [ ] to specify your crawl path,

then simply run the crawler via npm.start! Voila - a resulting output.json, after the crawling is completed.

For this hw, you'll simply submit your output.json - but its true purpose is to serve as input for a cstom GPT :)

From builder.io's GitHub page:

Amazing! You can use this to create all sorts of SMEs [subject matter experts] in the future, by simply scraping existing docs on the

web.

Q3.

For this question, you are going to download a small (3.56G) model (with 7B parameters, compared to GPT-4's 1T for ex!), and use it

along with an external knowledge source (a simple text fi le) vectorized using Chroma (a popular vector DB), and ask questions

https://github.com/BuilderIO/gpt-crawler

https://github.com/BuilderIO/gpt-crawler/blob/main/confi g.ts2024/4/28 05:55

8/16

whose answers would be found in the text fi le :) Fun!

git clone this: - and cd into it. You'll see a Python script (app.py) and a requirements.txt fi le.