Okay, folks, here’s the deal. This project is showing us how we can team up OpenAI with our knowledge base or other documents. And the cool part? We can do these fancy ‘semantic searches’ and even whip up prompts that we can tweak the generation of the LLM response just the way we like.
This project contains a Streamlit Chat interface and a Luigi ETL Pipeline that processes and stores documents into a Weaviate Vectorstore instance.
The ETL pipeline performs several tasks: converting Jupyter notebooks and Python scripts to Markdown format, cleaning the code blocks in the Markdown files, removing unnecessary files and directories, and uploading the processed data to a Weaviate Instance.
Before you start, ensure that you have Python 3.11 or later installed.
- Clone the repository:
git clone https://github.com/josoroma/etl-pipeline-for-langchain-docs
- Set up the project by installing the dependencies using the
setupdirective. This creates a virtual environment and installs the necessary packages specified in the
Environment Variables (.env)
Before running the application, set the following environment variables:
Running the ETL Pipeline
You can start the ETL pipeline by running the
run-etl-orchestrator directive. This command starts the Luigi pipeline with the orchestrator task.
poetry run python -m luigi --module orchestrator Orchestrator
To open the Luigi visualizer to monitor the progress of the ETL pipeline, you can use the
run-etl-visualizer directive. This command opens the Luigi visualizer interface in a web browser at
Luigi Tasks (ETL Orchestrator)
The ETL pipeline consists of several Luigi tasks:
BuildDataLake: Downloads data from a repository and builds a data lake.
ConvertIPYNBtoMyst: Converts Jupyter notebooks to MyST format.
ConvertPytoMd: Converts Python files to Markdown format.
DeleteNonMdFiles: Deletes files that are not in Markdown format.
RemoveEmptyDirectories: Removes empty directories.
CleanupCode: Cleans up the code in the Markdown files.
RemoveJupyterText: Removes any Jupyter-specific text from the Markdown files.
Upsert: Uploads the processed data to a Weaviate Vectorstore instance.
These tasks are orchestrated by the
OrchestratorTask in the
Running the Chat Interface
You can run the chat interface using the
run-chat directive. This command starts the Streamlit application with the
The configuration for the Weaviate class and schema is defined in the
weaviate_config.py script. The Weaviate class is named “Document”, and it uses the “text2vec-openai” vectorizer and the “generative-openai” module with the “gpt-4” model. The class has a single property “content” of data type “text”.
If you want to contribute to this project, please fork the repository and create a pull request, or open an issue for discussion.
We welcome contributions to this project! If you have insights, feedback, or code changes you’d like to contribute, please don’t hesitate to get involved.
This project is licensed under the terms of the MIT license: