# Profiling

User profiling pipeline: SQL Server data, deterministic signals/stats, optional LLM deep analysis, FastAPI surface.

## Prerequisites

- **Python 3.12+**
- **[ODBC Driver 18 for SQL Server](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)** installed on the host
- **uv** (optional; recommended) — or use **stdlib `venv` + `pip`** (see **Setup (venv + pip, without uv)** below)

Install **uv** (if you use it):

- **Ubuntu / Linux / macOS (shell):** `curl -LsSf https://astral.sh/uv/install.sh | sh`
- **Windows (PowerShell):** `powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"`

## Setup (uv + venv)

From the repository root:

```bash
uv venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
uv sync
```


You can also run commands without activating the venv: `uv run python …` uses the project environment automatically.

## Setup (venv + pip, without uv)

From the repository root, using only the standard library and pip:

```bash
python3.12 -m venv .venv
source .venv/bin/activate
```

**Windows:** `python -m venv .venv` then either `.venv\Scripts\activate.bat` (cmd) or `.venv\Scripts\Activate.ps1` (PowerShell).

Then install dependencies from `pyproject.toml` and run with `python` (not `uv run`):

```bash
pip install -U pip
pip install -e .
```

## Environment (`.env`)

Create a **`.env`** file in the project root (same directory as `main.py`). The app loads it via `python-dotenv`.

**Database (required)**

| Variable | Description |
|----------|-------------|
| `DB_SERVER` | SQL Server host |
| `DB_NAME` | Database name |
| `DB_USER` | Login |
| `DB_PASSWORD` | Password |

**LLM (only when deep analysis runs — pick one path)**

Deep analysis calls an LLM; other paths do not need these keys.

- **Batch jobs:** LLM keys are required only if `with_deep_analysis` is `true` in `scheduler_config.py` (`BATCH_SCHEDULER`). Set it to `false` to run batch profiling without any LLM credentials.
- **`GET /profile/{user_id}/data`:** Builds JSON from the DB only — no LLM keys.
- **`GET /profile/{user_id}`:** Reads stored markdown from the DB — no LLM keys.
- **`POST /profile/{user_id}`:** Runs deep analysis — **requires** LLM credentials (see below).

**Provider keys (when deep analysis is enabled)**

- **Default:** set `OPENROUTER_API_KEY` (OpenRouter). Do not set `USE_OPENAI`, or set it to `false`.
- **OpenAI instead:** set `USE_OPENAI=true` and `OPENAI_API_KEY`.

**Optional**

| Variable | Default / notes |
|----------|----------------|
| `DB_DRIVER` | `{ODBC Driver 18 for SQL Server}` |
| `DB_POOL_SIZE` | `7` |
| `OPENAI_MODEL_NAME` | `gpt-4o-mini` |
| `OPENROUTER_MODEL_NAME` | see `core/llm_provider.py` |
| `LOG_LEVEL` | `ERROR` (API console); use `INFO` or `DEBUG` for more logs |
| `LANGSMITH_API_KEY` | If set, enables LangSmith / LangChain tracing |
| `DEEP_ANALYSIS_MAX_TURNS` | Deep analysis agent turn limit (see `deep_analysis/agent.py`) |

**Example** (replace placeholders; do not commit real secrets):

```env
# --- Database (required) ---
DB_SERVER=your-sql-host
DB_NAME=your_database
DB_USER=your_user
DB_PASSWORD=your_password

# --- LLM: OpenRouter (default path) ---
OPENROUTER_API_KEY=your_openrouter_key

# --- Or OpenAI instead (uncomment and fill) ---
# USE_OPENAI=true
# OPENAI_API_KEY=your_openai_key

# --- Optional ---
# DB_DRIVER={ODBC Driver 18 for SQL Server}
# DB_POOL_SIZE=7
# LOG_LEVEL=INFO
# LANGSMITH_API_KEY=
# DEEP_ANALYSIS_MAX_TURNS=1
```

## How to run

Activate your **venv** first (uv or pip setup above). Then either:

| What | Command |
|------|---------|
| **HTTP API** (FastAPI) | `python main.py` — **port 5009**; OpenAPI at `/docs` |
| | or `uv run python main.py` (same; no venv activate needed if you use uv) |
| | or `uvicorn main:app --host 0.0.0.0 --port 5009` / `uv run uvicorn main:app --host 0.0.0.0 --port 5009` |
| | Production: `gunicorn main:app -k uvicorn.workers.UvicornWorker -w 4 -b 0.0.0.0:5009 --timeout 120` |
| **Batch profiling** | `python batch_scheduled_profiling.py` — optional `--dry-run`, `--fail-fast`; defaults in `scheduler_config.py` |
| | or `uv run python batch_scheduled_profiling.py` |
| | or `./scripts/run_batch_scheduled_profiling.sh` (same; always runs from repo root) |

### Cron (scheduled batch profiling)

The entrypoint is **`batch_scheduled_profiling.py`**. Cron does not load your shell profile, so use **absolute paths** and either **`uv`** on `PATH` in the crontab or a **venv interpreter**.

1. **Tune `scheduler_config.py`** (`BATCH_SCHEDULER`): e.g. `batch_size`, `with_deep_analysis`, `skip_recency_days`, publish flags. For **resume across long runs**, set `checkpoint_file` (JSON with `last_success_user_id`), e.g. `var/batch_checkpoint.json` (create `var/` on the server; it is gitignored).
2. **Ensure `.env`** exists in the repo root (same variables as the API — DB required; LLM keys only if `with_deep_analysis` is true).
3. **Install a crontab line**, for example daily at 02:00 server time:

```cron
0 2 * * * /home/deploy/Eatance-User-Profiling/scripts/run_batch_scheduled_profiling.sh >> /var/log/eatance-batch-profiling.log 2>&1
```

If one run can overlap the next, wrap with **`flock`** so only one batch runs at a time:

```cron
0 2 * * * flock -n /tmp/eatance-batch-profiling.lock /home/deploy/Eatance-User-Profiling/scripts/run_batch_scheduled_profiling.sh >> /var/log/eatance-batch-profiling.log 2>&1
```

4. **Smoke-test** before relying on cron:

```bash
./scripts/run_batch_scheduled_profiling.sh --dry-run
```