Running Tablular Cleaning Baselines

Table of Contents

-

-

Raha

https://github.com/BigDaMa/raha

PClean

1
git clone https://github.com/probcomp/PClean

Change Current Directory.

1
cd PClean

Activate the project and install dependencies.

1
2
3
4
5
6
7
8
9
# Start Julia REPL
julia

# Activate the project (assuming you're in the PClean directory)
using Pkg
Pkg.activate(".")

# Install all dependencies
Pkg.instantiate()

run command

1
JULIA_PROJECT=. julia experiments/hospital/run.jl

Cocoon

Running the Cocoon tutorial
https://github.com/Cocoon-Data-Transformation/cocoon

Garf

https://github.com/PJinfeng/Garf-master

Garf 需要使用Oracle数据库。

HoloClean

Git: https://github.com/HoloClean/holoclean

Relate Info

System Version:

1
2
Operating System: Ubuntu 24.04 LTS
Kernel: Linux 6.8.0-41-generic

0. Pre

1
2
3
4
docker run --name pghc \
-e POSTGRES_DB=holo -e POSTGRES_USER=holocleanuser -e POSTGRES_PASSWORD=abcd1234 \
-p 5432:5432 \
-d postgres:11

1. Download code

1
git clone https://github.com/HoloClean/holoclean

2. Create Conda Env And Enter

1
2
conda create -n hc36 python=3.6
conda activate hc36

3. Install dependency

1
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

4. Test

1

Question:

Q1:

Error in Insatll python-Levenshtein:

  • Ubuntu/Debian System:
    1
    sudo apt-get install build-essential python3-dev
  • CentOS/Fedora System:
    1
    2
    yum groupinstall 'Development Tools'
    yum install python3-dev

Q2:

Error Message:

1
2
3
4
File "python3.6/site-packages/smart_open/s3.py", line 9    
from __future__ import annotations
^
SyntaxError: future feature annotations is not defined

Cause:
After Python 3.7, has future annotations

So, degrade the version of smart_open

1
pip install smart_open==3.0.0

https://github.com/delgaudl/RTClean

RTClean

Download

1
git clone https://github.com/delgaudl/RTClean
1
git clone https://github.com/HoloClean/holoclean

1. Create conda env

1
conda create -n RTC38 python=3.8

2. Modify requirements.txt

1
2
3
4
5
6
7
8
9
10
11
12
gensim==3.8.3
numpy==1.19.5
pandas==1.1.5
psycopg2-binary==2.9.1
pyitlib==0.2.3
pytest-xdist==3.6.1
python-Levenshtein==0.12.2
scikit-learn==0.24.0
scipy==1.5.4
sqlalchemy==1.3.24
torch==1.7.1
tqdm==4.50.2

If you are using proxy, you may need set:

1
2
export https_proxy=http://127.0.0.1:7890
export http_proxy=http://127.0.0.1:7890
1
pip install -r requirements.txt

3. Modify Code

Error Message:

1
2
    tic = time.clock()
AttributeError: module 'time' has no attribute 'clock'

Replace all time.clock() to time.time()

4. Test holoclean

examples/holoclean_repair_example.py

5. Install extra requirements

1
pip install rdflib pyfuseki

Running Tablular Cleaning Baselines
https://www.hardyhu.cn/2025/02/24/Running-Tablular-Cleaning-Baselines/
Author
John Doe
Posted on
February 24, 2025
Licensed under